Voice Browsing the Web for Information Access

Rajeev Agarwal, Yeshwant Muthusamy, and Vishu Viswanathan

Media Technologies Lab
Texas Instruments Incorporated
P.O. Box 655303, MS 8374, Dallas, TX 75265
[rajeev | yeshwant | vishu]@csc.ti.com

Abstract

There is a large amount of information on the World Wide Web that is at the fingertips of anyone with access to the internet. However, so far this information has primarily been used by people who connect to the web via a traditional computer. This is about to change. Recent advances in wireless communication, speech recognition, and speech synthesis technologies have made it possible to access this information from any place, at any time, by using only a cellular phone. Some possible applications are browsing the web, getting stock quotes, verifying flight schedules, getting maps and directions for various locations, or checking E-mail. In this paper, we discuss different types of web-based applications, briefly describe our system architecture with examples of applications we have developed, and discuss some of the key issues in building spoken dialog applications for the web.

1. Introduction

The World Wide Web offers a plethora of information that is already at the fingertips of anyone who has access to the internet. However, so far this information has primarily been used by people who connect to the web via a traditional computer. This is about to change. Recent advances in wireless communication, speech recognition, and text-to-speech (TTS) synthesis technologies have made it possible to access this information from any place, at any time, by using only a cellular phone. Some possible applications are browsing the web, getting stock quotes, verifying flight schedules, getting maps and directions for various locations, or checking E-mail. The use of speech recognition for input and speech synthesis for output provides an easy-to-use interface that is ideal for hands/eyes free operation. In addition, it is important to have some degree of dialog management built into the system to handle erroneous, ambiguous, or incomplete utterances by the user and to facilitate naturally spoken input.

In this paper, we outline different categories of applications for voice browsers and present suggestions that should be helpful in setting standards for voice browsers. We also give an overview of our work in this area and discuss some design issues that are important in building portable, usable, and robust spoken interfaces for web-based applications.

2. Application Categories

We classify the variety of applications that can be built using voice browsers into the following three categories:

Web Browsing: This involves allowing the users the ability to go to any page on the web and browse the web by speaking the links on that page, rather than by clicking on them. In such applications, grammars for speech recognition can be dynamically created by parsing each web page as it is accessed. Such applications may or may not allow any audio feedback to the user.
Limited Information Access: This category includes applications that can provide useful information in limited domains. Good examples of such applications are querying about the weather in a city, checking stock quotes, etc. The user may ask for the required information in a few different ways, as allowed by the recognition grammar. The result is spoken back to the user in a format fixed by the text generation rules for that scenario. Such applications may be implemented in a couple of ways:
- Using Embedded Grammars: It is possible to embed the grammars for the speech recognizer as well as the tags used for audio output in the HTML for certain "smart" pages created specifically for the application. Note that HTML may either directly embed the grammars or synthesis rules or may have pointers to the files containing them.
- Using Java Applets: One may also write speech-enabled Java applets to perform the same task. The Java Speech API [JSAPI, 1998] specifies the interface between Java applets and speech recognition and synthesis engines. The speech recognition grammars and the text generation rules are incorporated within the applet.
Spoken Dialog Systems: The web-based spoken dialog systems are usually designed using a client-server architecture. The web page could be running a Java applet that acts as a client, which connects to a remote server that does the bulk of the work. In such systems, the grammars for the speech recognizer are loaded from the server. Further, the server controls the client's interaction with the user, thereby obviating the need for any embedded tags for either the recognition grammars or text generation rules. Examples of such systems include remote access to E-mail/Voice mail, finding certain businesses near the user's location, getting maps/directions for any place, etc.

Web browsing applications satisfy the basic need for accessing information on the web by voice. The user may go to any web page and then have the ability to speak all the links on that page. At Texas Instruments, we have developed a voice-driven web browsing tool called SAM (Speech-Aware Multimedia) that incorporates features such as speakable commands, speakable bookmarks, speakable links and "smart" pages. It also supports the Java Speech API allowing users to run speech-enabled Java applets. For a detailed description of SAM and related web-based speech applications, please refer to [Hemphill and Thrift, 1995] and [Hemphill and Muthusamy, 1997].

The limited information access applications take voice browsers one step further by giving the user the ability to provide input for certain speech-based form-filling applications. Once the user has specified the values of certain fields in a form, the information from that page may be used to query either a CGI script or a Java servlet in order to access the desired information. Although this is a very useful feature, there is a limited set of tasks that one can accomplish with the small, fixed grammars of these systems. For most "real" applications, it is important to provide the user with a natural interface and this requires a spoken dialog system. We believe that this is where the largest number of information access applications are possible. We have also done considerable work in building Java applet-based limited information access systems and in building spoken dialog systems.

In the next section, we present some suggestions for modifications/extensions to HTML4 or CSS2 that may help in setting standards for voice browsers. Some of these suggestions may already be well understood by some but we include them here for the sake of completion. In Section 4, we provide an overview of our work in developing several prototype spoken dialog systems. Finally, in Section 5, we discuss additional design issues that are important in building such systems.

3. Suggestions for HTML4/CSS2 Modifications

The voice browsers that exist today do not have much in terms of standards for better integration of the speech recognizer and the TTS synthesizer with the web browser. In this section, we address some of the issues in such an integration and provide suggestions that may be useful in setting standards for this process.

For error handling, Raggett and Ben-Natan [1998] suggest two possible handlers OnSelectionTimeout and OnSelectionError. We suggest the inclusion of two additional error handlers `OnRejectionError' and `OnHelp' which are described as follows:
- OnRejectionError: Since what the user is allowed to say at any time is limited by the state of the system, it is possible that some user utterances may be unexpected for the system and are likely to get misrecognized. Many recognizers (including ours) incorporate score-based rejection, so that if the system does not have enough confidence in its recognition, it asks the user to repeat the utterance rather than trying to process whatever it recognized. The OnRejectionError handler will specify what should be said to the user in case the recognizer rejects his/her utterance.
- OnHelp: The user may sometimes get confused about what he/she is supposed to say to the system. It would be nice to embed a help message within the HTML so that whenever the user asks for help (by saying anything in a standard help grammar, such as "help", or "what can I say", etc.), the voice browser can provide it. Note that this feature would be more useful for smaller applications; in larger spoken dialog systems, it becomes essential to provide context-sensitive help which is often provided by the server that understands the state of the dialog.
There appears to be no standard way to specify speech grammars at this point in time. Although Cascading Style Sheets can be used to specify the feedback string to be sent to the speech synthesizer, there appears to be no equivalent format for speech input. None of the recognized media descriptors lend themselves nicely to speech recognition. We propose the addition of the "speech" media descriptor that would enable the specification of the speech recognition grammar by using a STYLE statement such as:

The formats for the specification of these speech grammars do not seem to have been standardized as yet. We propose that a subset of the Java Speech Grammar Format (JSGF) should be used for this purpose. It is a good standard that has already been adopted by speech researchers.
The specification of text generation rules for any application needs to be standardized. One example is the use of "cue-before" and "cue-after" tags, as outlined in [Raggett and Ben-Natan, 1998]. The generated text can be annotated with additional markers to assist the TTS in properly pronouncing the text. The annotation format may be adapted from the Java Speech Markup Language (JSML). JSML is a subset of XML and has a standardized extensible syntax that is not tied to the Java language. Further, any JSML document can easily be converted to a well-formed XML document by adding one line at the top.
We suggest that speech recognizers integrated within a voice browser should have the ability to dynamically load new grammars at runtime and to switch between grammars that have already been loaded. In the web browsing application, this allows the recognizer to go to any previously unseen page on the web and dynamically generate and load the grammars for all links on that page. In the example described in [Raggett and Ben-Natan, 1998], if the user's previous query resulted in four options, the user can be prompted to speak any of the options rather than prompting "Say 1 for Company Info, 2 for Latest News, 3 for Placing an Order,...". Another example would be that of an E-mail system, where the names of all the senders of E-mail are dynamically loaded after E-mail is checked, so as to enable the user to then ask for E-mail from specific people. Although it is certainly possible to accomplish all this with a large vocabulary speech recognizer as well, the dynamic grammar capability is useful in vastly improving the recognition accuracy.

4. Our Voice-Driven Information Access Prototypes

At Texas Instruments, we have created an architecture for enabling the development of spoken dialog systems. Our architecture is built specifically for the purposes of web-based information access via speech. The overall architecture is client-server in nature. The client could be embedded in a cellular phone while the server could be at any remote location. The speech synthesis as well as speech recognition are currently done solely on the client but each would eventually be split between the client and the server. The client also has a Global Positioning System (Garmin 12XL) on it, since that greatly enhances the array of useful applications that can be created. The server, on the other hand, has a portable dialog manager that can easily used in different applications. The dialog manager is responsible for identifying all useful information from the user's utterance, generating a CGI query on-the-fly, parsing the HTML that comes back as the response for the query, and generating the textual message that must be spoken back to the user. In addition, it handles any errors, ambiguities or incompleteness in the user's requests, helps in constraining the grammars for the speech recognizer, and provides context-sensitive help.

Some of the sample applications we have developed are described next. These include getting flight, weather, and stock information from an InfoPhone, remote access to E-mail/voice mail, voice navigation to get maps/directions to different locations, and business locator service to find certain businesses near the user's current location. The InfoPhone application belongs to the second category of limited information access while the others fall under the category of spoken dialog systems.

InfoPhone: Although the mobile user may have access, via radio broadcasts, to useful information such as stock quotes, flight schedules and weather forecasts, this information is (i) not customized to each user, and (ii) only available when he is in the car or has other means of access to a radio. Our InfoPhone prototype system attempts to solve this problem by providing easy access to information sources on the web from the user's cellular phone. The InfoPhone is currently a speech-enabled Java applet that simulates a cellular phone. Users can choose one of flights, stocks and weather from a top level menu and interact with each ``service'' by speech commands. ``Keypad'' (non-speech) input is also available as a backup. The applet incorporates separate grammars for company names (for stocks), flight numbers and city names (for weather). We envision these grammars to be customized for each user when he signs up for the system. The dynamic grammar switching capability of our speech recognizer allows the user to switch between these grammars on-the-fly. Speech input to the applet is processed by the recognizer and the information request is sent to a server that accesses the appropriate web site, retrieves the HTML page, extracts just the essential information and transmits it to the applet. The results of the information retrieval are played out by the TTS system and displayed on the small ``phone display''. This way, the user is not forced to look at the display for the information. We expect the InfoPhone system to be a valuable information retrieval tool for the mobile user.
Remote E-mail/Voice Mail Access: The Voice E-mail system has a client-server architecture and is completely voice-driven. Users talk to the system and listen to messages and prompts played back by the speech synthesizer. The system has a minimal display (for status messages) and is designed to operate primarily in a "displayless" mode, where the user can effectively interact with the system without looking at a display. For people who prefer a display for non-driving conditions, an optional display can be incorporated into this system. The server handles all of the e-mail/voice-mail functions. It accesses the e-mail and voice-mail servers and handles the receiving, sending and storage of the messages. It communicates with the client via sockets. The client provides the user interface and handles the reading, navigation, categorization, and filtering of e-mail and voice-mail messages. It has both speech recognition and TTS capabilities and does not maintain constant connection to the server (to reduce connection time charges). It connects to the server only to initiate or end a session, check for new mail or to send a message. An important aspect of the displayless user interface is that the user should, at all times, know exactly what to do, or should be able to find out easily. To this end, we have incorporated an elaborate context-dependent help feature. If the user gets lost, he also has the ability to reset all changes and start over from the beginning. The current system is an extension of previous collaborative work with MIT [Marx 1995] and handles reading, filtering, categorization and navigation of e-mail messages. It will soon incorporate voice-mail send and receive (using Caller ID information) and later, the capability to ``compose'' and send e-mail (for example, using speech-based form-filling).
Voice Navigation: We have developed an application to obtain maps and/or directions for different places in a city as naturally as possible - by voice I/O only. This navigation system is primarily aimed at hands-busy, eyes-busy conditions such as automobile driving. An optional display is provided for situations where the user may safely look at the screen, for example when the car is parked. All textual feedback is spoken back to the user using speech synthesis. A dialog manager is used to handle all interactions with the user. Recall that the GPS, the speech recognizer, and the speech synthesis are on the navigation client. The navigation server interacts with the MapQuest^TM web site (www.mapquest.com), which acts as the map server. The user may ask for maps and/or directions to either specific addresses in a city or to certain points of interest in that city. Further, any displayed maps may be zoomed to different levels or moved in any direction using voice commands. As an example, the user may say "give me directions to 8330 LBJ Freeway" or "how do I get to the Doubletree Hotel?" and get directions to these places. If the system knows about more than one hotel by the same name, the dialog manager will first ask the user to resolve that ambiguity before attempting to retrieve the map.
Business Locator: It is often the case that mobile users do not want to go to any specific address or points of interest, but just want to find some business near their current location. We have developed a system that has about 30 different yellow pages categories (like "chinese restaurant", "gas station", "towing service", "locksmith", "convenience store", etc.) incorporated into it. Users may make queries like "Find the nearest chinese restaurant" and get the name, address, and phone number for it. Further, this service has been integrated with the voice navigation application such that once the nearby businesses are found, the user may also get maps and/or directions to them, if needed. We use the GTE SuperPages^TM web site (www.superpages.com) as our yellow pages server.

5. Additional Design Issues

In Section 3, we described some of the issues related to the proper integration of a speech recognizer and a speech synthesizer into a browser. We now present additional design issues related to the development of better speech applications for the web. We have already incorporated some of these issues into our application design, as discussed below.

It is important to design systems that can function completely with voice I/O only, to facilitate hands/eyes free operation. It is not convenient for users to have to speak into the phone and then look at what is displayed on the screen. We are working towards achieving this goal in our applications.
We chose Java as the programming language of choice for our applications, especially on the client side. Java has several network-friendly features like downloadable applets, remote method invocation and object serializability, which render it useful for client-server wireless phone applications. For example, some of these applications need not reside on the phone at all, but be downloaded on demand and removed after use, freeing up memory on the phone for other tasks.
The dialog manager is implemented as a two-layered architecture with a domain-independent upper layer and a domain-dependent lower layer. This is based on the contention that for information access tasks, the dialog between the user and the system proceeds in a domain-independent manner at a higher level and can be described by a set of domain-independent states. This architecture makes it easier to port the dialog manager to different applications. It is described in detail in [Agarwal, 1997].
We have also made efforts to design the system so as to minimize the connection between the client and the server, thus reducing connection time charges for the customer. This is especially true of our remote E-mail access application.
The generation of CGI queries is driven by a couple of domain-specific data files read by the dialog manager. So, CGI queries for any new application can be generated by just creating the corresponding files.
The generation of the text messages to be spoken to the user is template-driven with the templates specified in a data file. Again, this helps in easily generating appropriate feedback for different applications.
If the application involves retrieving large images over the network, bandwidth issues also become important. For example, in the navigation application, each map may take 20-60 Kbytes. Wireless networks currently support only up to 14.4 Mbps and hence it is important to consider possible compression of images by automatically reducing the resolution of the images, using only 8-bit color in place of full color, etc.
For certain client devices, as in cellular phones, large displays are not possible but relevant information may not fit into a small traditional display screen. In such cases, one needs to consider alternatives such as microdisplays that promise high resolution displays in small-sized screens.
For a mobile user, the background car noise or other channel distortion may reduce the accuracy of a speech recognizer. This could have a critical impact on the acceptability of the application. It is important to use adaptation algorithms that can help to improve the recognition performance.

We believe that there will be an explosion of speech applications for the web in the next few years. It is important that some standards be set now so that all voice browsers can provide at least some features in a consistent manner. We have also discussed additional design issues which, although not directly related to specification of HTML4/CSS2 standards, are nevertheless important to consider when building such applications.

References

[Agarwal, 1997] "Towards a PURE spoken dialogue system for information access," by Rajeev Agarwal, In Proceedings of the ACL/EACL Workshop on Interactive Spoken Dialog Systems, pp. 90-97, Madrid, Spain, July 1997.

[Hemphill and Thrift, 1995] "Surfing the Web by Voice", by C. Hemphill and P. Thrift. In Proceedings of ACM Multimedia, San Francisco, CA, November 7-9, 1995, pp. 215-222.

[Hemphill and Muthusamy, 1997] "Developing Web-based Speech Applications", by C. Hemphill and Y. K. Muthusamy. In Proceedings of Eurospeech '97, Rhodes, Greece, September 1997, Vol. 2, pp. 895-898.

[JSAPI, 1998] "Java Speech Application Programming Interface", Beta Version 0.7. Located at http://java.sun.com/products/java-media/speech, June 1998.

[JSGF, 1998] "Java Speech Grammar Format Specification", Beta Version 0.6. Located at http://java.sun.com/products/java-media/speech/forDevelopers/JSGF/index.html, April 1998.

[JSML, 1998] "Java Speech Markup Language Specification", Beta Version 0.5. Located at http://java.sun.com/products/java-media/speech/forDevelopers/JSML/index.html, August 1998.

[Marx, 1995] "Towards Effective Conversational Messaging", by M. T. Marx, Master's Thesis, MIT, June 1995.

[Raggett and Ben-Natan, 1998] "Voice Browsers", by Dave Raggett and Or Ben-Natan, Located at http://www.w3.org/TR/NOTE-voice, 1998.