Media Technologies Lab
Texas Instruments Incorporated
P.O. Box 655303, MS 8374, Dallas, TX 75265
[rajeev | yeshwant | vishu]@csc.ti.com
There is a large amount of information on the World Wide Web that is at the fingertips of anyone with access to the internet. However, so far this information has primarily been used by people who connect to the web via a traditional computer. This is about to change. Recent advances in wireless communication, speech recognition, and speech synthesis technologies have made it possible to access this information from any place, at any time, by using only a cellular phone. Some possible applications are browsing the web, getting stock quotes, verifying flight schedules, getting maps and directions for various locations, or checking E-mail. In this paper, we discuss different types of web-based applications, briefly describe our system architecture with examples of applications we have developed, and discuss some of the key issues in building spoken dialog applications for the web.
In this paper, we outline different categories of applications for voice browsers and present suggestions that should be helpful in setting standards for voice browsers. We also give an overview of our work in this area and discuss some design issues that are important in building portable, usable, and robust spoken interfaces for web-based applications.
Web Browsing: This involves allowing the users the ability to go to any page on the web and browse the web by speaking the links on that page, rather than by clicking on them. In such applications, grammars for speech recognition can be dynamically created by parsing each web page as it is accessed. Such applications may or may not allow any audio feedback to the user.
Limited Information Access: This category includes applications that can provide useful information in limited domains. Good examples of such applications are querying about the weather in a city, checking stock quotes, etc. The user may ask for the required information in a few different ways, as allowed by the recognition grammar. The result is spoken back to the user in a format fixed by the text generation rules for that scenario. Such applications may be implemented in a couple of ways:
Using Embedded Grammars: It is possible to embed the grammars for the speech recognizer as well as the tags used for audio output in the HTML for certain "smart" pages created specifically for the application. Note that HTML may either directly embed the grammars or synthesis rules or may have pointers to the files containing them.
Using Java Applets: One may also write speech-enabled Java applets to perform the same task. The Java Speech API [JSAPI, 1998] specifies the interface between Java applets and speech recognition and synthesis engines. The speech recognition grammars and the text generation rules are incorporated within the applet.
Spoken Dialog Systems: The web-based spoken dialog systems are usually designed using a client-server architecture. The web page could be running a Java applet that acts as a client, which connects to a remote server that does the bulk of the work. In such systems, the grammars for the speech recognizer are loaded from the server. Further, the server controls the client's interaction with the user, thereby obviating the need for any embedded tags for either the recognition grammars or text generation rules. Examples of such systems include remote access to E-mail/Voice mail, finding certain businesses near the user's location, getting maps/directions for any place, etc.
The limited information access applications take voice browsers one step further by giving the user the ability to provide input for certain speech-based form-filling applications. Once the user has specified the values of certain fields in a form, the information from that page may be used to query either a CGI script or a Java servlet in order to access the desired information. Although this is a very useful feature, there is a limited set of tasks that one can accomplish with the small, fixed grammars of these systems. For most "real" applications, it is important to provide the user with a natural interface and this requires a spoken dialog system. We believe that this is where the largest number of information access applications are possible. We have also done considerable work in building Java applet-based limited information access systems and in building spoken dialog systems.
In the next section, we present some suggestions for modifications/extensions to HTML4 or CSS2 that may help in setting standards for voice browsers. Some of these suggestions may already be well understood by some but we include them here for the sake of completion. In Section 4, we provide an overview of our work in developing several prototype spoken dialog systems. Finally, in Section 5, we discuss additional design issues that are important in building such systems.
For error handling, Raggett and Ben-Natan  suggest two possible handlers OnSelectionTimeout and OnSelectionError. We suggest the inclusion of two additional error handlers `OnRejectionError' and `OnHelp' which are described as follows:
OnRejectionError: Since what the user is allowed to say at any time is limited by the state of the system, it is possible that some user utterances may be unexpected for the system and are likely to get misrecognized. Many recognizers (including ours) incorporate score-based rejection, so that if the system does not have enough confidence in its recognition, it asks the user to repeat the utterance rather than trying to process whatever it recognized. The OnRejectionError handler will specify what should be said to the user in case the recognizer rejects his/her utterance.
OnHelp: The user may sometimes get confused about what he/she is supposed to say to the system. It would be nice to embed a help message within the HTML so that whenever the user asks for help (by saying anything in a standard help grammar, such as "help", or "what can I say", etc.), the voice browser can provide it. Note that this feature would be more useful for smaller applications; in larger spoken dialog systems, it becomes essential to provide context-sensitive help which is often provided by the server that understands the state of the dialog.
There appears to be no standard way to specify speech grammars at this point in time. Although Cascading Style Sheets can be used to specify the feedback string to be sent to the speech synthesizer, there appears to be no equivalent format for speech input. None of the recognized media descriptors lend themselves nicely to speech recognition. We propose the addition of the "speech" media descriptor that would enable the specification of the speech recognition grammar by using a STYLE statement such as:
<STYLE type="text/grammar" media="speech">
The formats for the specification of these speech grammars do not seem to have been standardized as yet. We propose that a subset of the Java Speech Grammar Format (JSGF) should be used for this purpose. It is a good standard that has already been adopted by speech researchers.
The specification of text generation rules for any application needs to be standardized. One example is the use of "cue-before" and "cue-after" tags, as outlined in [Raggett and Ben-Natan, 1998]. The generated text can be annotated with additional markers to assist the TTS in properly pronouncing the text. The annotation format may be adapted from the Java Speech Markup Language (JSML). JSML is a subset of XML and has a standardized extensible syntax that is not tied to the Java language. Further, any JSML document can easily be converted to a well-formed XML document by adding one line at the top.
We suggest that speech recognizers integrated within a voice browser should have the ability to dynamically load new grammars at runtime and to switch between grammars that have already been loaded. In the web browsing application, this allows the recognizer to go to any previously unseen page on the web and dynamically generate and load the grammars for all links on that page. In the example described in [Raggett and Ben-Natan, 1998], if the user's previous query resulted in four options, the user can be prompted to speak any of the options rather than prompting "Say 1 for Company Info, 2 for Latest News, 3 for Placing an Order,...". Another example would be that of an E-mail system, where the names of all the senders of E-mail are dynamically loaded after E-mail is checked, so as to enable the user to then ask for E-mail from specific people. Although it is certainly possible to accomplish all this with a large vocabulary speech recognizer as well, the dynamic grammar capability is useful in vastly improving the recognition accuracy.
Some of the sample applications we have developed are described next. These include getting flight, weather, and stock information from an InfoPhone, remote access to E-mail/voice mail, voice navigation to get maps/directions to different locations, and business locator service to find certain businesses near the user's current location. The InfoPhone application belongs to the second category of limited information access while the others fall under the category of spoken dialog systems.
InfoPhone: Although the mobile user may have access, via radio broadcasts, to useful information such as stock quotes, flight schedules and weather forecasts, this information is (i) not customized to each user, and (ii) only available when he is in the car or has other means of access to a radio. Our InfoPhone prototype system attempts to solve this problem by providing easy access to information sources on the web from the user's cellular phone. The InfoPhone is currently a speech-enabled Java applet that simulates a cellular phone. Users can choose one of flights, stocks and weather from a top level menu and interact with each ``service'' by speech commands. ``Keypad'' (non-speech) input is also available as a backup. The applet incorporates separate grammars for company names (for stocks), flight numbers and city names (for weather). We envision these grammars to be customized for each user when he signs up for the system. The dynamic grammar switching capability of our speech recognizer allows the user to switch between these grammars on-the-fly. Speech input to the applet is processed by the recognizer and the information request is sent to a server that accesses the appropriate web site, retrieves the HTML page, extracts just the essential information and transmits it to the applet. The results of the information retrieval are played out by the TTS system and displayed on the small ``phone display''. This way, the user is not forced to look at the display for the information. We expect the InfoPhone system to be a valuable information retrieval tool for the mobile user.
Remote E-mail/Voice Mail Access: The Voice E-mail system has a client-server architecture and is completely voice-driven. Users talk to the system and listen to messages and prompts played back by the speech synthesizer. The system has a minimal display (for status messages) and is designed to operate primarily in a "displayless" mode, where the user can effectively interact with the system without looking at a display. For people who prefer a display for non-driving conditions, an optional display can be incorporated into this system. The server handles all of the e-mail/voice-mail functions. It accesses the e-mail and voice-mail servers and handles the receiving, sending and storage of the messages. It communicates with the client via sockets. The client provides the user interface and handles the reading, navigation, categorization, and filtering of e-mail and voice-mail messages. It has both speech recognition and TTS capabilities and does not maintain constant connection to the server (to reduce connection time charges). It connects to the server only to initiate or end a session, check for new mail or to send a message. An important aspect of the displayless user interface is that the user should, at all times, know exactly what to do, or should be able to find out easily. To this end, we have incorporated an elaborate context-dependent help feature. If the user gets lost, he also has the ability to reset all changes and start over from the beginning. The current system is an extension of previous collaborative work with MIT [Marx 1995] and handles reading, filtering, categorization and navigation of e-mail messages. It will soon incorporate voice-mail send and receive (using Caller ID information) and later, the capability to ``compose'' and send e-mail (for example, using speech-based form-filling).
Voice Navigation: We have developed an application to obtain maps and/or directions for different places in a city as naturally as possible - by voice I/O only. This navigation system is primarily aimed at hands-busy, eyes-busy conditions such as automobile driving. An optional display is provided for situations where the user may safely look at the screen, for example when the car is parked. All textual feedback is spoken back to the user using speech synthesis. A dialog manager is used to handle all interactions with the user. Recall that the GPS, the speech recognizer, and the speech synthesis are on the navigation client. The navigation server interacts with the MapQuestTM web site (www.mapquest.com), which acts as the map server. The user may ask for maps and/or directions to either specific addresses in a city or to certain points of interest in that city. Further, any displayed maps may be zoomed to different levels or moved in any direction using voice commands. As an example, the user may say "give me directions to 8330 LBJ Freeway" or "how do I get to the Doubletree Hotel?" and get directions to these places. If the system knows about more than one hotel by the same name, the dialog manager will first ask the user to resolve that ambiguity before attempting to retrieve the map.
Business Locator: It is often the case that mobile users do not want to go to any specific address or points of interest, but just want to find some business near their current location. We have developed a system that has about 30 different yellow pages categories (like "chinese restaurant", "gas station", "towing service", "locksmith", "convenience store", etc.) incorporated into it. Users may make queries like "Find the nearest chinese restaurant" and get the name, address, and phone number for it. Further, this service has been integrated with the voice navigation application such that once the nearby businesses are found, the user may also get maps and/or directions to them, if needed. We use the GTE SuperPagesTM web site (www.superpages.com) as our yellow pages server.
In Section 3, we described some of the issues related to the proper integration of a speech recognizer and a speech synthesizer into a browser. We now present additional design issues related to the development of better speech applications for the web. We have already incorporated some of these issues into our application design, as discussed below.
We believe that there will be an explosion of speech applications for the web in the next few years. It is important that some standards be set now so that all voice browsers can provide at least some features in a consistent manner. We have also discussed additional design issues which, although not directly related to specification of HTML4/CSS2 standards, are nevertheless important to consider when building such applications.
[Hemphill and Thrift, 1995] "Surfing the Web by Voice", by C. Hemphill and P. Thrift. In Proceedings of ACM Multimedia, San Francisco, CA, November 7-9, 1995, pp. 215-222.
[Hemphill and Muthusamy, 1997] "Developing Web-based Speech Applications", by C. Hemphill and Y. K. Muthusamy. In Proceedings of Eurospeech '97, Rhodes, Greece, September 1997, Vol. 2, pp. 895-898.
[JSAPI, 1998] "Java Speech Application Programming Interface", Beta Version 0.7. Located at http://java.sun.com/products/java-media/speech, June 1998.
[JSGF, 1998] "Java Speech Grammar Format Specification", Beta Version 0.6. Located at http://java.sun.com/products/java-media/speech/forDevelopers/JSGF/index.html, April 1998.
[JSML, 1998] "Java Speech Markup Language Specification", Beta Version 0.5. Located at http://java.sun.com/products/java-media/speech/forDevelopers/JSML/index.html, August 1998.
[Marx, 1995] "Towards Effective Conversational Messaging", by M. T. Marx, Master's Thesis, MIT, June 1995.
[Raggett and Ben-Natan, 1998] "Voice Browsers", by Dave Raggett and Or Ben-Natan, Located at http://www.w3.org/TR/NOTE-voice, 1998.