Position Paper: Towards Improving Audio Web Browsing
Michael Wynblatt, Stuart Goose
KEYWORDS: WWW, hypertext, audio, browsing, telephone.
Driver Information Systems (DIS) is emerging as a growth commercial application area. Our focus with LIAISON has been on providing drivers with access to existing information sources by developing an eyes-free and, for the most part, hands-free WWW browser. The idea has been to make listening to the web analogous to listening to the radio, so that drivers can make more productive use of their commuting time. The radio analogy is important, since it stresses minimal interaction and a passive system; in an automobile, safety is the first concern, and flexibility is third or fourth priority. Clearly, one of the central challenges here is to provide a simple navigation framework that demands minimal interaction.
Our DICE telephony browser  is primarily intended to supplement the regular desktop WWW access by providing ubiquitous access through the world's most pervasive device. Since a "keyboard" of sorts is available, and hands-free browsing is not as critical, DICE can have more emphasis on providing a rich feature set than LIAISON. As a result, we are exploring issues in user-driven navigation, such as skimming. As the poor quality of telephone audio tends to compound any inadequacies in the speech synthesis engine, the coherence of the rendering itself is also at a premium here.
Both DICE and LIAISON rely on the underlying WIRE  technology to produce audio renderings of HTML documents. WIRE does not act as a screen reader for rendered HTML, nor does it extract the text content of a document and ignore the structure. Like visual browsers, such as Netscape Navigator and Internet Explorer, the goal of WIRE is to interpret the content and structure of an HTML document and provide a faithful rendition. WIRE primarily uses speaker changes and announcements (sign-posting) to convey meta-information to the user, and makes sparing use of earcons. We have found that the alternative, broad use of earcons and sound effects, tends to distract listeners from the content. Our experience confirms the user-study results by James, "[an] interface using many sounds was rated the lowest for both general appropriateness and general likability" . One aspect of our ongoing work is to investigate the use of audio spatialization for conveying content, structural cues and environmental conditions.
Since our interest is mainly in existing content, rather than specialty audio content, our goal has been to provide faithful renditions for a large coverage of existing web-sites. The bulk of our research effort has been to generate heuristics for interpreting highly formatted HTML documents that inevitably contain the liberal use of constructs such as tables and frames. WIRE converts these structures into a meaningful audio context in which the user can browse. This differs from other efforts which focus on dissuading authors from using such formatting tags, or the promulgation of audio-friendly specification standards.
The Aural Style Sheet W3C CCS2 proposal  provides a good foundation. Audio volume, spatialization, pausing, and speaker customization are fundamental requirements for an audio style specification. Many of the difficulties in preparing an audio rendering lie with the ambiguities in the written language and the appropriate interpretation by the automated system of the HTML constructs. In the remainder of this section we suggest further ideas to reduce such ambiguities.
One example is navigation indexes. Many WWW documents contain regions consisting of dense clusters of hyperlinks, often to other documents on the same web-site. Users of visual browsers can immediately recognize such areas by their appearance, and only direct their attention to these areas when needed. Users of audio browsers, however, can only recognize such areas implicitly from the fact that there are several links in a row, or explicitly if their browser can recognize the navigation area and announce it. Such announcements act as a signpost for the user to understand their current position. The first option can be extremely tedious if the user must listen to 15 hyperlinks before the main content of the document is reached, especially since many sites have the same indexes at the top of every document. The second option is preferable, especially when coupled with an option to skip over the area, and it would be made simpler and less error-prone if the author had a method for delineating navigation indexes.
Another useful signpost is for sections. The HTML header tag acts as an appropriate way to mark the beginning of a new section. Unfortunately many HTML documents contain many sub-regions without titles and thus without header tags. These sections are implicit to the visual user, who can understand them from visual context (see figure 1). Audio users, however, need prompting from the browser, or run the risk of losing the flow of the document. Again, it is possible to provide automated detection of such regions based on the structural elements that lead to the visual context. It would be much simpler, though, to have such sections explicitly indicated by a tag.
Caption: Figure 1. Few of the visually explicit sections of this WWW document have titles, which mean that they are unlikely to have header tags. (Red lines added for emphasis.)
Evidence suggests that humans can process multiple simultaneous conversations and that this ability can be exploited for more efficient browsing . Hence, this provides an argument for the support of multiple spatialized speech synthesis channels in addition to both the static and streamed sounds. Extensions to HTML/XML could contain constructs for specifying which sections of the document relate to the positions, or trajectories, of the multiple speech synthesis channels. One potential problem with supporting multiple simultaneous speech synthesis streams is that two or more of these streams may demand user input, perhaps, specified by a language such as VoxML . The ability to select a specific stream for input must be supported by the browser.
While not wholly appropriate for the application of audio browsers, the SMIL specification  offers some useful constructs for indicating how media streams are cued for sequential playing and how multiple streams are played in parallel. Extensions to HTML/XML could contain some extensions to the general SMIL constructs that are more specific to the requirements of audio browsers. If the browser uses an underlying technology, such as Microsoft DirectSound  or Intel RSX , for performing spatialization and mixing, in conjunction with extended SMIL constructs for audio scheduling, this has the potential for a flexible and dynamic system.
 Aural Cascading Style Sheets (ACSS). W3C Note, http://www.w3.org/Style/css/Speech/NOTE-ACSS.
 Goose, S., Wynblatt, M. and Mollenhauer, H., 1-800-Hypertext: Browsing Hypertext With A Telephone, Proceedings of the ACM International Conference on Hypertext, Pittsburgh, USA, pages 287-288, June 1998. http://www.scr.siemens.com/pdf/goose.pdf
 Intel RSX, http://developer.intel.com/ial/rsx/
 James, F., Lessons from Developing Audio HTML Interfaces, Assets 98: Proceedings of the Conference on Assistive Technologies, Marina Del Ray, USA, April 1998.
 Microsoft DirectSound, http://www.microsoft.com/directx/pavilion/dsound/default.asp
 Motorola, VoxML: The Mark-up Language for Voice, http://www.voxml.com
 Schmandt, C. and Mullins, A., AudioStreamer: Exploiting Simultaneity for Listening, Proceedings of the ACM International Conference on Computer Human Interfaces (CHI), 1995. http://www.acm.org/sigchi/chi95/Electronic/documnts/shortppr/cs_bdy.htm
 Synchronized Multimedia Integration Language (SMIL) 1.0 Specification, http://www.w3.org/TR/REC-smil/
 Wynblatt, M., Benson, D., and Hsu, A., Browsing the World Wide Web in a Non-Visual Environment, Proceedings of the International Conference on Auditory Display (ICAD), Palo Alto, USA, pages 135-138, November 1997. http://www.santafe.edu/~kramer/icad/websiteV2.0/Conferences/ICAD97/Wynblatt.pdf