Position paper - Standards for voice browsing

Source: PipeBeach AB, www.pipebeach.com
Date: 1998-10-09

PipeBeach is developing a voice browser for enterprise servers. It provides interactive access to web pages over conventional and cellular telephone lines. The product supports both DTMF and speech input from the user, as well as speech synthesis and digital audio output.

We believe that the major obstacle to wide-scale commercial deployment of voice browsers for the web is not the technology, but the ease (or difficulty!) with which web page designers can add speech support to their site.

From this perspective, it would be desirable for the voice browser to render interactive speech dialogs from standard HTML web pages. Our experience has shown that it is indeed possible.

However, since HTML has primarily been developed for visual rendering (and it is conventionally used in this way), there will be elements which are not amenable to speech rendering (e.g. image maps), as well as web design practice which make speech rendering more awkward than visual rendering (e.g. the use of frames). Furthermore, there are many aspects of speech rendering which the designer will not have control over since there are no corresponding HTML tags; for example, using a specific (type of) recognizer or synthesizer, associating specific recognition grammar with input elements, control over the synthesizer volume, speed, pitch, etc and as well as interaction handlers for timeouts, errors, and so on.

One way of dealing with these problems is to deploy a specific markup language for speech rendering. This has been the strategy behind many initiatives emerging from telco companies; for example, WML (Wireless Markup Language) for small devices from the WAP consortium, and, it seems, VoxML for voice browsers.

Creating a new XML based languages is certainly the way to go when it comes to expanding the capabilities of the web into industry specific areas. However adding voice capabilities is a fundamental improvement NOT related to any specific industry. Potentially all web pages and all browsers will have voice capabilities. PipeBeach's position is therefore that this standard should be administered by W3C as part of the "natural" evolvement of the current web standards possibly supported by new voice specific work groups.

With this approach it is less likely that web designers will have to develop different pages specialized for different types of browsers delaying voice browsing to go mainstream. As an example it is proposed that HTML is extended so as to provide optional browser-specific tags. Thus, a web page designer is able to optionally add elements which exploit the power of voice browsers. The CSS2 Aural Style sheet is a good move in this direction, but there should be a single open W3C standard which also addresses speech recognition, as well as interaction issues, for voice browsing.

In summary, our position is:

HTML as of today can be used with good results today to build intuitive voice dialogs,
Voice browsers should be able to make a 'best guess' at rendering any HTML page as a speech dialog,
Adding voice capabilities to the web should be regarded as a natural evolvement of the web and not a industry specific derivative
Adding voice capabilities to the net should be an open standard initiative administered by W3C,
PipeBeach is committed to support open standards so that voice browsing can become part of the mass market.