Position Paper: Towards Improving Audio Web Browsing

Position Paper: Towards Improving Audio Web Browsing

Michael Wynblatt, Stuart Goose
Multimedia and Video Technology Group Siemens Corporate Research, Inc. Princeton, NJ, USA {wynblatt, sgoose}@scr.siemens.com

KEYWORDS: WWW, hypertext, audio, browsing, telephone.

1. OUR EXPERIENCES TO DATE

At Siemens Corporate Research, we have been designing audio HTML browsers since mid-1996. We have focused our efforts on two applications: an automobile-based browser called LIAISON and a telephone-based browser called DICE. Both of these systems rely upon our underlying WIRE (Web-based Interactive Radio Environment) technology for audio rendering of Internet content.

Driver Information Systems (DIS) is emerging as a growth commercial application area. Our focus with LIAISON has been on providing drivers with access to existing information sources by developing an eyes-free and, for the most part, hands-free WWW browser. The idea has been to make listening to the web analogous to listening to the radio, so that drivers can make more productive use of their commuting time. The radio analogy is important, since it stresses minimal interaction and a passive system; in an automobile, safety is the first concern, and flexibility is third or fourth priority. Clearly, one of the central challenges here is to provide a simple navigation framework that demands minimal interaction.

Our DICE telephony browser [2] is primarily intended to supplement the regular desktop WWW access by providing ubiquitous access through the world's most pervasive device. Since a "keyboard" of sorts is available, and hands-free browsing is not as critical, DICE can have more emphasis on providing a rich feature set than LIAISON. As a result, we are exploring issues in user-driven navigation, such as skimming. As the poor quality of telephone audio tends to compound any inadequacies in the speech synthesis engine, the coherence of the rendering itself is also at a premium here.

Both DICE and LIAISON rely on the underlying WIRE [9] technology to produce audio renderings of HTML documents. WIRE does not act as a screen reader for rendered HTML, nor does it extract the text content of a document and ignore the structure. Like visual browsers, such as Netscape Navigator and Internet Explorer, the goal of WIRE is to interpret the content and structure of an HTML document and provide a faithful rendition. WIRE primarily uses speaker changes and announcements (sign-posting) to convey meta-information to the user, and makes sparing use of earcons. We have found that the alternative, broad use of earcons and sound effects, tends to distract listeners from the content. Our experience confirms the user-study results by James, "[an] interface using many sounds was rated the lowest for both general appropriateness and general likability" [4]. One aspect of our ongoing work is to investigate the use of audio spatialization for conveying content, structural cues and environmental conditions.

Since our interest is mainly in existing content, rather than specialty audio content, our goal has been to provide faithful renditions for a large coverage of existing web-sites. The bulk of our research effort has been to generate heuristics for interpreting highly formatted HTML documents that inevitably contain the liberal use of constructs such as tables and frames. WIRE converts these structures into a meaningful audio context in which the user can browse. This differs from other efforts which focus on dissuading authors from using such formatting tags, or the promulgation of audio-friendly specification standards.

2. CUSTOM SPECIFICATIONS FOR AUDIO CONTENT

Our audio browsing efforts have focused on producing faithful renderings of existing content. We believe that it is not practical to hope that the majority of the millions of WWW authors will redesign their content to conform to audio-friendly standards. It seems clear, though, that if a useful audio WWW terminal were in wide in use, as we hope our products will be, a large number of audio-specific WWW content and applications would be introduced. To this end, we are happy to participate in the discussion of how best to specify audio-oriented content. Our work has given us an appreciation for what is difficult in automating audio document rendering, and what could be improved during the authoring process to provide additional aid to the audio browser.

The Aural Style Sheet W3C CCS2 proposal [1] provides a good foundation. Audio volume, spatialization, pausing, and speaker customization are fundamental requirements for an audio style specification. Many of the difficulties in preparing an audio rendering lie with the ambiguities in the written language and the appropriate interpretation by the automated system of the HTML constructs. In the remainder of this section we suggest further ideas to reduce such ambiguities.

Language: It seems simple enough, but it is a non-trivial task to analyze and recognize whether a particular text is in English, German, French or another language. This is a critical problem because speech synthesis engines are, by nature, language-specific. Reading a German text passage with an English speech synthesis engine produces completely unintelligible results. Although automatically determining the language for a long, or even medium-length, text passage is feasible, it becomes nearly impossible for one or two word digressions. A multi-language dictionary look-up for every word in a document can unfavorably affect rendering time, and does not guarantee a result. A simple solution would be to add a tag to indicate words which are in a different language than the main language of the document.
Phonetics: Many proper names are butchered by speech synthesis engines. A phonetic tag describing a pronunciation-friendly value would help significantly with this problem.
Signposts: One of the main tasks of the WIRE rendering engine is to guess the intent of the author by analyzing the HTML document. Generally an author who conforms to good specification techniques by using the more abstract HTML tags (e.g., <h2> vs <b>) can make this process smoother. However, there are certain important informational cues which are so obvious in visual representations that nobody thought to support them through explicit HTML constructs.
Applets/Scripts: The use of active content, through scripting and applets, is becoming more pervavsive. Unless the author is sufficiently diligent, by creating "NOSCRIPT" alternatives and descriptions, it is very difficult to provide any support for this construct.

One example is navigation indexes. Many WWW documents contain regions consisting of dense clusters of hyperlinks, often to other documents on the same web-site. Users of visual browsers can immediately recognize such areas by their appearance, and only direct their attention to these areas when needed. Users of audio browsers, however, can only recognize such areas implicitly from the fact that there are several links in a row, or explicitly if their browser can recognize the navigation area and announce it. Such announcements act as a signpost for the user to understand their current position. The first option can be extremely tedious if the user must listen to 15 hyperlinks before the main content of the document is reached, especially since many sites have the same indexes at the top of every document. The second option is preferable, especially when coupled with an option to skip over the area, and it would be made simpler and less error-prone if the author had a method for delineating navigation indexes.

Another useful signpost is for sections. The HTML header tag acts as an appropriate way to mark the beginning of a new section. Unfortunately many HTML documents contain many sub-regions without titles and thus without header tags. These sections are implicit to the visual user, who can understand them from visual context (see figure 1). Audio users, however, need prompting from the browser, or run the risk of losing the flow of the document. Again, it is possible to provide automated detection of such regions based on the structural elements that lead to the visual context. It would be much simpler, though, to have such sections explicitly indicated by a tag.

Caption: Figure 1. Few of the visually explicit sections of this WWW document have titles, which mean that they are unlikely to have header tags. (Red lines added for emphasis.)

3. RICHER SUPPORT FOR AUDIO MEDIA

As alluded to previously, one avenue of our research is the application of spatialized audio for conveying document content, structural cues and environmental conditions. For these features it is clear that support for the 3D mixing of multiple simultaneous sound sources is required within the client browser. It should be possible for such sound sources to be either static files downloaded in advance, Real Audio streams downloaded on demand or synthesized audio such as MIDI.

Evidence suggests that humans can process multiple simultaneous conversations and that this ability can be exploited for more efficient browsing [7]. Hence, this provides an argument for the support of multiple spatialized speech synthesis channels in addition to both the static and streamed sounds. Extensions to HTML/XML could contain constructs for specifying which sections of the document relate to the positions, or trajectories, of the multiple speech synthesis channels. One potential problem with supporting multiple simultaneous speech synthesis streams is that two or more of these streams may demand user input, perhaps, specified by a language such as VoxML [6]. The ability to select a specific stream for input must be supported by the browser.

While not wholly appropriate for the application of audio browsers, the SMIL specification [8] offers some useful constructs for indicating how media streams are cued for sequential playing and how multiple streams are played in parallel. Extensions to HTML/XML could contain some extensions to the general SMIL constructs that are more specific to the requirements of audio browsers. If the browser uses an underlying technology, such as Microsoft DirectSound [5] or Intel RSX [3], for performing spatialization and mixing, in conjunction with extended SMIL constructs for audio scheduling, this has the potential for a flexible and dynamic system.

REFERENCES

[1] Aural Cascading Style Sheets (ACSS). W3C Note, http://www.w3.org/Style/css/Speech/NOTE-ACSS.

[2] Goose, S., Wynblatt, M. and Mollenhauer, H., 1-800-Hypertext: Browsing Hypertext With A Telephone, Proceedings of the ACM International Conference on Hypertext, Pittsburgh, USA, pages 287-288, June 1998. http://www.scr.siemens.com/pdf/goose.pdf

[3] Intel RSX, http://developer.intel.com/ial/rsx/

[4] James, F., Lessons from Developing Audio HTML Interfaces, Assets 98: Proceedings of the Conference on Assistive Technologies, Marina Del Ray, USA, April 1998.

[5] Microsoft DirectSound, http://www.microsoft.com/directx/pavilion/dsound/default.asp

[6] Motorola, VoxML: The Mark-up Language for Voice, http://www.voxml.com

[7] Schmandt, C. and Mullins, A., AudioStreamer: Exploiting Simultaneity for Listening, Proceedings of the ACM International Conference on Computer Human Interfaces (CHI), 1995. http://www.acm.org/sigchi/chi95/Electronic/documnts/shortppr/cs_bdy.htm

[8] Synchronized Multimedia Integration Language (SMIL) 1.0 Specification, http://www.w3.org/TR/REC-smil/

[9] Wynblatt, M., Benson, D., and Hsu, A., Browsing the World Wide Web in a Non-Visual Environment, Proceedings of the International Conference on Auditory Display (ICAD), Palo Alto, USA, pages 135-138, November 1997. http://www.santafe.edu/~kramer/icad/websiteV2.0/Conferences/ICAD97/Wynblatt.pdf