Considerations in Producing a Commercial Voice Browser

Michael B. Robin
Charles T. Hemphill

Conversational Computing
8522 154th Avenue NE
Redmond, Washington 98052
mikero@conversa.com
hemphill@conversa.com

Abstract

Conversational Computing has produced a voice browser that works in conjunction with a standard HTML browser. We describe some possible uses for a voice browser and some of the features incorporated into this browser to facilitate voice interaction. Toward the goal of voice enabling content on the Web, we offer some examples of how page design and HTML extensions might enhance the voice browser experience.

Why a Voice Browser?
Features of Conversa Web™
Voice Friendly HTML
References

Why a Voice Browser?

Browsers for HTML were primarily designed with the mouse and keyboard in mind. In many potential applications of browsing, however, use of the hands during browsing might prove inconvenient or impossible. Voice input is a natural solution for such hands-busy situations. Even in standard browser applications, using voice input is simply more fun than the alternatives.

Conversational Computing has produced a voice browser called Conversa Web [CC98]. This browser replaces the mouse in most instances to enable hands-free browsing [HT95]. Using voice input to select links, for example, is often less tedious than using the mouse. Voice input provides direct "see and say" access to links, eliminating the wrist strain associated with holding the mouse for often hours at a time. Devices with inconvenient mouse access include some notebook computers, PDAs, cell phones with displays, and set-top boxes.

There is a growing trend to use voice browsers for various applications. Many body-worn computers incorporate voice recognition for inspection, repair, and maintenance. Other applications include information kiosks and voice-driven presentations [HM97]. Furthermore, many applications being built today use the browser interface as the GUI.

Features of Conversa Web™

Conversa Web voice-enables the components of Internet Explorer to create a voice browser. The primary features include voice enabling links, favorites, and browser navigation commands.

To voice enable links, Conversa Web includes all likely ways of speaking the link. For example, to speak the link 1998 CUI Introduction, the user might say "nineteen-ninety eight C U I Introduction". Because Conversa Web uses text-to-speech rules for unknown words, it also allows the user to say "coo-ee" for CUI. As the user speaks a link, Conversa Web triggers as soon as it hears enough to distinguish it from other choices. As a result, users need not speak the entire link in the case of long links. After the user speaks a link, Conversa Web briefly highlights the link to give feedback to the user regarding the proper selection. Once Conversa Web asks the Web for the page, it begins to play music to provide aural feedback in case of Web delays during page retrieval.

Conversa Web uses Saycons™ for links associated with images. The term "Saycon" is a shortened from of "sayable icons". For each image associated with a link, Conversa Web superimposes a cartoon dialog bubble that contains a number. Users simply say the associated number in the Saycon to activate an image-based link. This mechanism solves the problem of bad alt tags -- we have found that less than 20% of the alt tags accurately reflect what a user might expect to say based on the associated image. Saycons are also used to disambiguate identical textual links associated with distinct URLs.

While surfing, users may add pages to the favorites list simply by saying "add to favorites". Conversa Web uses the title of the page to create an entry in the favorites list. At any time after that, users may select from a list favorite items by voice.

Voice commands for voice surfing include the standard ones such as "page up", "go back", etc., but they also include commands to make the browser "go to sleep" and "wake up". Voice commands for scrolling allow users to position the page in the desired position or read while the text automatically scrolls on the screen. Conversa Web also includes voice activated help via the command "Conversa Help Me".

Voice Friendly HTML

Conversa Web enables voice browsing in standard HTML Web pages. However, improvements in three distinct areas might improve the voice browsing experience: conventions for Web page authors, Web page authoring tools, and the HTML language itself.

Most Web page authors create pages without considering that users might surf them by voice. Using some simple conventions, authors can often enhance the voice browsing experience without changing the experience for traditional surfing. Some examples include

Use text descriptions in addition to or instead of URLs. Descriptions are normally easier speak than URLs.
- Before: This version: http://www.w3.org/WAI/UA/WD-WAI-USERAGENT-19980814
- After: This version (http://www.w3.org/WAI/UA/WD-WAI-USERAGENT-19980814).
Use text descriptions in addition to or instead of E-mail addresses.
- Before: Contact Mike Robin at Conversational Computing, mikero@conversa.com.
- After: Contact Mike Robin Conversational Computing (mikero@conversa.com).
Use more than one word in a link when appropriate (e.g., a noun phrase). This decreases ambiguity and increases accuracy.
- Before: the WAI UA charter discusses goals.
- After: the WAI UA charter discusses goals.
Avoid ambiguous links - they require Saycons or other special treatment.
- Before: Click here for local news, or here for national news.
- After: Select local news or national news
Avoid the term "click" in a page - the term "select" provides a less mouse-centric alternative (see previous example).
With image anchors, always use alt tags that accurately reflect the text in the image.
Never use images in anchors when it is either unclear that they are anchors or they do not clearly reflect what the user might say to select them.
Never use server side image maps.

We emphasize that the current set of features in Conversa Web allow the user to select a link by voice in all but the last of these examples, but following the guidelines will often make voice input more natural to the user.

While much of the responsibility of creating a voice friendly Web page lies with the author, the authoring tool can also help. The authoring tools should

Suggest alternative ways of specifying URLs and E-mail addresses when it detects them in links.
Automatically detect ambiguous links and bring them to the author's attention.
Automatically include alt tag information from text used to create graphics.
Issue warnings about the use of server side image maps and other speech unfriendly elements.

Advanced HTML authoring tools might even detect similar sounding words and offer alternatives to improve recognition accuracy.

Finally, certain extensions to the HTML language can facilitate voice input. As an example, selection lists are specifically designed for operation with the mouse. It ought to be possible to easily ask for the options in a selection list by voice, but HTML does not support this. We can accommodate the selection with Saycons, but a selection phrase directly associated with the selection list might offer a better alternative. The same discussion applies to form field values and other HTML elements.

Because of past misuse or disuse of tags such as the alt tag, a new tag ought to indicate that the page has been designed with speech input in mind. The new tag might indicate, for example, that the voice browser could trust the alt tags and thereby include them in the active vocabulary set.

References

[CC98] Conversation Computing home page (http://www.conversa.com).

[HT95] "Surfing the Web by Voice", by C. Hemphill and P. Thrift. In Proceedings of ACM Multimedia, San Francisco, CA, November 7-9, 1995, pp. 215-222.

[HM97] "Developing Web-based Speech Applications", by C. Hemphill and Y. Muthusamy. In Proceedings of Eurospeech '97, Rhodes, Greece, September 1997, Vol. 2, pp. 895-898.