W3C NOTE-voice-19980128

Voice Browsers

W3C NOTE 28th January 1998

This version:
Latest Version:
Dave Raggett, W3C/HP
Or Ben-Natan, Microsoft

Status of this document

This document is a NOTE made available by the W3 Consortium for discussion only. This indicates no endorsement of its content, nor that the Consortium has, is, or will be allocating any resources to the issues addressed by the NOTE.

This note describes features needed for effective interaction with Web browsers that are based upon voice input and output. Some extensions are proposed to HTML 4.0 and CSS2 to support voice browsing, and some work is proposed in the area of speech recognition and synthesis to make voice browsers more effective.

The minimal change needed to CSS2 is to allow literal text for the cue-before and cue-after properties. For HTML 4.0, The accesskey attribute needs to be added to the SELECT, OPTGROUP and OPTION elements. This appears to be an oversight in the HTML 4.0 specification itself.

A couple of new events OnSelectionTimeout and OnSelectionError are proposed to improve the ergnomics of the user interface for voice browsers. A number of additional changes are needed for robust speech recognition and high quality speech synthesis.


This note describes features needed for voice browsers. These are browsers that exploit voice input and output, using speech synthesis and prerecorded sound for output (together with small displays when available), and a combination of keyboard and speech recognition for input.

The technology will make it practical to browse the Web from any telephone. W3C has a major role to play in facilitating the development of open standards for voice browsers, which are expected to become available during the next year or so.

Voice browsers use speech synthesis and prerecorded material to present the contents of Web pages. A variety of aural effects can be used to give different emphasis to headings, hypertext links, list items and so on.

Users interact with voice browsers by spoken command or by pressing a button on a keypad. Some commands interrupt the browser. For instance to request a list of hypertext links occurring in the current section of the document. Other commands are given when the browser prompts for input, for example, to select an option from a menu in a form.

To increase the robustness of speech recognition, voice browsers take advantage of contextual clues provided by the author. This allows the recognition engine to focus on likely utterances, improving the chances of a correct match. Work is needed to specify how such contextual clues are represented.

Speech synthesis is driven by dictionaries, falling back for unknown words on rules for regular pronunciation. High quality speech synthesis is possible if the author can extend the dictionary resident in the browser.

This should be practical without the author being a qualified phonetician with a deep understanding of terms such as Labiodental, and Alveolo-palatal fricatives. Work is needed to establish a simple means for authors to specify how particular words should be spoken.

Alternate Media

Speech synthesis is not as good as having an actor read the text. Content providers will inevitably want to provide prerecorded content for some parts of Web pages.

Prerecorded content is analogous to the use of images in visual Web pages. The same need for textual fall-backs applies for printing, searching and access by users with disabilities.

Note that prerecorded content is likely to include music and different speakers (think about radio adverts). These effects can be reproduced to some extent via the aural style sheets features in CSS[CSS2].


Alternatives to using a mouse click:

For an application using a cellular phone, it would be cumbersome to have to take the handset away from your ear to press a button on the keypad. It therefore makes sense to support voice input for navigation as an alternative to keyboard input.

At its simplest the user could speak the word "follow" when she hears a hypertext link she wishes to follow. The user could also interrupt the browser to request a short list of the relevant links, e.g.

    User: links?

    Browser: The links are:

      1   company info
      2   latest news
      3   placing an order
      4   search for product details

    Please say the number now

    User: 2

    Browser: Retrieving latest news ...

A more sophistocated voice browser would allow the user to say a few words to indicate which link she is interested in. For example: I want to place an order. The browser could use simple template matching rules to select a matching link, akin to those used in the AI program "Eliza" which mimics a conversation with a therapist.

For this to work, the author is likely to need some control over the speech recognition parameters. This control includes pointers to vocabulary, template rules, definition of sensitivity and more.

Another command could be used to request a list of the document's headings. This would allow users to browse an outline form of the document as a means to get to the section that interests them.

Forms and Input Fields

Voice browsers will allow users to move between form fields and to enter and review field values, using either the keyboard or voice input.

Authors must be able to specify what spoken phrases should be used for the selection of links, radio buttons, check boxes, image buttons, submit buttons, and selection lists. (Key access is already provided by the accesskey attribute in HTML 4 [HTML4]).

Handling Errors and Ambiguities

In a voice based browser it is easy for the user to enter unexpected or ambiguous input, or just to pause, providing no input at all. Some examples:

Authors must have control over the browser response to selection errors and timeouts

Aural Style Sheets

Authors want control over how the document is rendered. Aural style sheets (part of CSS2 [ CSS2]) provide a basis for controlling a range of features, including:

Inserted text

When a hypertext link is spoken by a speech synthesiser, the author may wish to insert text before and after the link's caption, to guide the user's response.

For example:

    <A href="driving.html">Driving instruction</A>

May be offered by the voice browser using the following words:

    For driving instructions press 1

The example shows how the words "For" and "Press 1" were added to the text embedded in the anchor element.

On first glance it looks as if this 'wrapper' text should be left for the voice browser to generate, but on further examination you can easily find problems with this approach.

For example, how would you offer the following anchor element?

    <A href="LeaveMessage.html">Leave us a message</A>

In the English language you could say

    To leave us a message, press 5

A safe assumption that other languages will have even more structures and words which apply to special cases.

The CSS2 draft specification includes the means to provide "generated text" before and after element content.

For example:

    H1:before {content: "Chapter" decimal(chapno)
        display: block}

This relies on :before and :after psuedo elements to name the positions, and the content property to provide the text to be inserted. Unfortunately, this mechanism doesn't work with the HTML style attribute.

You are therefore forced into using the STYLE element in the document head:

    <style type="text/css">
       #link1 :before {content: "For "}
       #link1 :after {content: ", press 1"}
       #link5 :before {content: "To "}
       #link5 :after {content: ", press 5"}
    <A id="link1" href="driving.html">Driving instruction</Agt;
    <A id="link5" href="LeaveMessage.html">Leave us a message</A>

It would be much more convenient if you could specify the text to insert in a style attribute on the link itself. The cue-before and cue-after properties in CSS2 as part of the aural style sheet proposal seem ideal for this purpose:

    <A style='cue-before: "To"; cue-after: ", press 5"'
         href=LeaveMessage.html>Leave us a message</A>

If you want to autonumber links include % in the cue text. The % is expanded to "1", "2" or "3", and so on, according to the order in which the link appears in the markup. The previous example could be re-written as:

    <A style='cue-before: "To"; cue-after: ", press %"'
         href=LeaveMessage.html>Leave us a message</A>

We need to get CSS2 revised to extend cue-before and cue-after to support literal text. They currently can only be used with URLs for auditory icons.

Property name: 'cue-after'
Value: <url> | "quoted string" | none
Initial: none
Applies to: all elements
Inherited: no
Percentage values: N/A


Property name: 'cue-after'
Value: <url> | "quoted string" | none
Initial: none
Applies to: all elements
Inherited: no
Percentage values: N/A

URLs are written in a functional notation url( url ). This avoids any ambiguity with quoted strings.

Access Keys

The HTML 4 accesskey attribute can in principle be used to identify which key to press for a given link, for instance:

    <A accesskey="5"
          style='cue-before: "To"; cue-after: ", press 5"'
          href=LeaveMessage.html>Leave us a message</A>

To ensure that the spoken cue matches the access key, the accesskey attribute supports the same autonumbering mechanism as cue-before and cue-after, for instance:

    <A accesskey="%"
          style='cue-before: "To"; cue-after: ", press %"'
          href=LeaveMessage.html>Leave us a message</A>

The accesskey attribute needs to be added to the SELECT, OPTGROUP and OPTION elements. This appears to be an oversight in the HTML 4.0 specification itself.

What is missing is a media dependent way to bind keys to particular links or form fields etc. Whether or not this is an important omission needs to be resolved.

Text to Speech

Text to speech dictionaries contain information on how each word is to be spoken by a speech synthesiser. This covers both phonemes and prosody (stress). The pronunciation may depend on the context in which a word occurs. As a result limited linguistic analysis may be needed to determine which pronunciation applies.

For instance, in the example below, the word "read" is pronounced as "red" in the first line and as "reed" in the second line:

Standard dictionaries for each language are likely to be incomplete, missing irregular words for personal names, place names, technical terms and abbreviations. For this reason, authors need a way to provide supplementary text to speech information and to indicate when it applies.

Specialized representations for phonemic and prosodic information can be off putting for non-specialist users. For this reason it is common to see simplified ways to write down pronunciation, for instance, the word "station" can be defined as:

station: stay-shun

This kind of approach is likely to encourage users to add pronunciation information, leading to an increase in the quality of spoken documents, as compared with more complex and harder to learn approaches.

A language independent representation must cope with the full range of sounds and stress patterns found in the world's languages. A promising starting point is the International Phonetic Alphabet [IPA]. For greater flexibility in representing prosodic information, it may be appropriate to define a markup notation, based upon XML [ XML].

There is a strong case for W3C to facilitate the development of a standard way to encode such information, bring together experts from industry and academia. This will maximise the chances of interoperability, and create a market for speech fonts and speech synthesis software based upon open standards.

Detailed Proposals

Voice Files

One way to handle this is to use the OBJECT element to reference the voice file with the content of the OBJECT element providing the textual fall-back e.g.

    <object data="advert.au" type="audio/basic">
      Hey there buddy, have you heard of the fantastic
      offers on cruises in the Carribean this winter?
      Get 50% off now from <a href=horizon>Horizon
      vacations and leave the big freeze behind!</a>

The author might want to use an image for graphical browsers. This could be represented as an outer OBJECT element for the image, wrapping the audio object:

    <object data="advert.jpeg" type="image/jpeg">
      <object data="advert.au" type="audio/basic">
        Hey there buddy, have you heard of the fantastic
        offers on cruises in the Carribean this winter?
        Get 50% off now from <a href=horizon>Horizon
        vacations and leave the big freeze behind!</a>

The spoken word is generally as important as the written word. This justifies a simple mechanism for speech as opposed to a more general and inevitably more complex mechanism based upon metadata. With this in mind, a particularly simple approach is to add an attribute to HTML elements, that links to a voice file for use in place of the element's content. This attribute could be used on elements such as DIV, TABLE, and OBJECT. For instance:

    <div voicefile="advert.au"">
      Heh there buddy, have you heard of the fantastic
      offers on cruises in the Carribean this winter?
      Get 50% off now from <a href=horiz>Horizon
      vacations and leave the big freeze behind!</a>

Speech Recognition Grammar

    Attibute name: grammar
    Value: CDATA
    Applies to: All elements

The "grammar" attribute allows the inclusion of a grammar block with an input tag. The grammar block allows a speech recognition engine to analyze different type of speech in a better way. At the present, the proposal does not include the format of the block. This will have to be done in coordination with the speech recognition industry.

An HTML page may include a check box. The title of the check box may be "Are you an American Citizen". A voice based user agent may ask the user, with the help of a text to speech engine, "Are you an American Citizen" The possible answers may be "Yes" or "No" but it could also be any other word used for negative or positive response in the caller's language. It could be "Ya," "you batch'ya," "sure," "of course" and many other expressions. It is necessary to feed the speech recognition engine with likely utterances representing the desired response.

When the page includes a sequence of hypertext links, a grammar attribute supplied with an enclosing element (e.g. P, UL, LI or DIV) can be used to provide recognition templates. This technique can also be used together with the SELECT element for menus, and for the FIELDSET element for groups of radio buttons and checkboxes etc.

A template is a string with tagged slots that either match any substring or which match a restricted set of substrings. The approach offers much greater flexibility than simple string matching.

Error Handling

An error response is an event generated by a element which solicits input by talking to the user and waiting for input. Two types of error response are proposed. An error for a situation where no selection is made or no input is entered, and an error for a case where a selection is made for something which is not offered.


The browser may generate an OnSelectionTimeout event when the user is asked to provide input of any kind, such as a selection from a list of anchors or an text input box and fails to do so within a browser dependent time-out (settable via scripts).

For example, the following block may be offered the user for navigation.

    <P onselecttimeout='document.speak("You have
        not entered any selection, please enter your
        selection now")'>
        <A href=Instructions.html>Directions</A> |
        <A href=Todo>List of things to do</A>

The OnSelectionTimeout event is processed by the browser according to the browser own definition of timeout for input entry or selection of anchor tags. The OnSelectTimeout event applies to all block tags as well as form elements.


When the user selects an option not offered by the browser the user must be notified that an error occurred. The notification and the resulting action is to be performed by a script associated with OnSelectionError event.


    <P onselectionerror='document.speak("The selection
         you have entered is invalid, please enter your
         selection again now")'>
        <A href=Instructions.html>Directions</A> |
        <A href=Todo>List of things to do</A>

Next Steps

The format for the grammar block and for text to speech information needs to be defined in consulation with experts from the speech synthesis and recognition industry and centers of academic research in this area. Perhaps the best way to move forward is to discuss the issues presented in this Note in the W3C workshop on Mobile Devices scheduled for April in Tokyo.

The workshop could be followed by setting up one or more working groups. Another possibility would be to issue a briefing package and a call for participation in working groups prior to holding the workshop.

In the meantime, it may be practical for the proposals for inserted text, voicefiles and error responses to be reviewed in the existing W3C groups, in particular, the Web Accessibility Initiative, the CSS & FP working group and the HTML coordination group.


"Cascading Style Sheets, level 1", Håkon Wium Lie and Bert Bos, December 1996.
Available at http://www.w3.org/TR/REC-CSS1-961217.html
"Cascading Style Sheets, level 2", Håkon Wium Lie, Bert Bos and Ian Jacobs.
Available at http://www.w3.org/TR/WD-CSS2/
"Hypertext Markup Language (HTML) Version 4.0", Dave Raggett, Arnaud Le Hors and Ian Jacobs. December 1998.
Available at http://www.w3.org/TR/REC-html40/
The International Phonetic Alphabet is defined by the International Phonetic Association, see: http://www.arts.gla.ac.uk/IPA/ipa.html
The Extensible Markup Language (XML). Information about XML is available at http://www.w3.org/XML/

Copyright  ©  1997 W3C (MIT, INRIA, Keio ), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.