This document is [will be]a W3C Working Draft for review by W3C members and other interested parties. It is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". A list of current W3C working drafts can be found at: http://www.w3.org/pub/WWW/TR/
Note: since working drafts are subject to frequent change, you are advised to reference the above URL, rather than the URLs for working drafts themselves.
Those of us who are sighted are accustomed to visual presentation of HTML documents, frequently on a bitmapped display. This is not the only possible method, however.
Aural presentation, using a combination of speech synthesis and 'audio icons', provides an alternative presentation. This form of presentation is in current use by the print-impaired community. Often such aural presentation occurs by converting the document to plain text and feeding this to a screen reader. This results in less effective presentation than would be the case if the document structure were retained.
There are other large markets for aural presentation, including in-car and home entertainment use; aurual or mixed aural/visual presentation is likely to increase in importance over the next few years. Realizing that that the aural rendering is essentially independent of the visual rendering:
Given this current and future importance, it makes sense to influence presentation based on the structure of the document. Cascading Style Sheets may be used for this purpose. Using style sheets rather than HTML tag extensions allows the same document to be read with visual, aural, or mulitomodal presentation without cluttering up the document or having to produce three (or more) separate parallel documents which has been shown to result in update problems. This approach provides greatly improved document accessibility for visually disabled people without requiring compromises in the visual design of the document.
This document draws very heavily on the initial properties proposed by T.V. Raman. It also attempts to address those issues, raised on the www-style mailing list, that were altered in the later modification but does not address them in the same way.
This document builds upon the CSS1 specification.
This section extends the CSS1 specification to allow additional types of value.
Style sheets influence the presentation of documents by assigning values to style properties. This section lists the defined style properties for aural presentation, and their corresponding list of possible values, expressed using CSS1 syntax.
The list of CSS1 properties has been kept to a minimum, while making sure commonly used styles can be expressed. Depending on the formatting model and the presentation medium, some properties can prove hard to incorporate into existing UA implementations. According to the conformance rules, UAs should make efforts to format documents according to the style sheets, but full support for all properties cannot be expected. E.g., a monaural browser is not able to fully honor spatial audio, but should approximate by mixing all the channels together.
In the text below, the allowed values for each property are listed with a syntax like the following:
Value: N | NW | NE
Value: [ <length> | thick | thin ]{1,4}
Value: <uri>? <color> [ / <color> ]?
Value: <uri> || <color>
The words between < and > give a type of value. The most common types are <length>, <percentage>, <url>, <number> and <color> these are described in the section on units. The more specialized types (e.g. <font-family> and <border-style>) are described under the property where they appear.
Other words are keywords that must appear literally, without quotes. The slash (/) and the comma (,) must also appear literally.
Several things juxtaposed mean that all of them must occur, in the given order. A bar (|) separates alternatives: one of them must occur. A double bar (A || B) means that either A or B or both must occur, in any order. Brackets ([]) are for grouping. Juxtaposition is stronger than the double bar, and the double bar is stronger than the bar. Thus "a b | c || d e" is equivalent to "[ a b ] | [ c || [ d e ]]".
Every type, keyword, or bracketed group may be followed by one of the following modifiers:
This specification introduces two new units in addition to the units of CSS1.
These are the legal angle units:
These are the legal time units:
Spatial audio is an important stylistic property for aural presentation. It provides a natural way to tell several voices apart, the same way we use in real life (people rarely all stand in the same spot of a room). Stereo speakers produce a lateral soundstage. Binaural headphones or the increasingly popular 5-speaker home theatre setups can generate full surround sound, and multi-speaker setups can create a true three-dimensional soundstage. VRML 2.0 also includes spatial audio, which implies that spatial audio hardware will become more widely available.
The value is given in the range 0deg <= x < 360deg where 0deg is interpreted as directly ahead in the center of the sound stage. 90deg is to the right, 180deg behind and 270deg to the left. It may also be specified using keywords:
keyword | value | value with 'behind' |
---|---|---|
side-left | 270deg | 270deg |
far-left | 300deg | 240deg |
left | 320deg | 220deg |
center-left | 340deg | 200deg |
center | 0deg | 180deg |
center-right | 20deg | 160deg |
right | 40deg | 140deg |
far-right | 60deg | 120deg |
side-right | 90deg | 90deg |
This property is most likely to be implemented by mixing the same signal into different channels at differing volumes. It might also use phase shifting, digital delay, and other such techniques to provide the illusion of a soundstage. The precise means used to acheive this effect and the number of speakers used to do so are browser dependent - this property merely identifies the desired end result. Examples:
H1 { azimuth: 30deg } TD.a { azimuth: far-right } /* 60deg */ P#12 { azimuth: behind far-right } /* 120deg */ P.comment { azimith: behind } /* 180deg */
Do we need relative values like "more-to-the-right" ?
UAs should attempt to honor this request if they have resources to do so. If spatial-azimuth is specified and the output device cannot produce sounds behind the listening position, values in the rearwards hemisphere should be converted into forwards hemisphere values. One method is as follows:
The value is given in degrees in the range -90deg to 90deg. 0deg is interpreted as on the forward horizon, which loosely means level with the listener. 90deg is directly overhead and -90 is directly underneath. The precise means used to acheive this effect and the number of speakers used to do so are undefined. This property merely identifies the desired end result. UAs should attempt to honor this request if they have resources to do so. Examples:
H1 { elevation: above } TR.a { elevation: 60deg } TR.b { elevation: 30deg } TR.c { elevation: level }
What about relative values: "higher", "lower" ??
Value: <percentage> | x-soft | soft | medium | loud | x-loud
Initial: medium
Applies to: all elements
Inherited: yes
Percentage values: relative to user-specified mapping
There is a fixed mapping between keyword values and percentages:
Volume refers to the median volume of the waveform. In other words, a highly inflected voice at a volume of 50 might peak well above that. Note that '0%' does not mean "mute". It represents the minimum audible volume level and 100% corresponds to the maximum comfortable level. The UA should allow the values corresponding to 0% and 100% to be set by the user. Suitable values depend on the equipment in use (speakers, headphones), the environment (in car, home theater, library) and personal preferences. Some examples:
The same authors stylesheet could be used in all cases, simply by mapping the 0 and 100 points suitably at the client side.
Value: <time> {1,2} (for 'pause' only)
Initial: UA specific
Applies to: all elements
Inherited: no
Percentage values: NA
These properties specify the pause before and after elements relative to the default pause. The 'pause' property is a shorthand for setting 'pause-before' and 'pause-after'.
Examples:
H1 { pause: 20ms } /* pause-before: 20ms; pause-after: 20ms */ H2 { pause: 30ms 40ms } /* pause-before: 30ms; pause-after: 40ms */ H3 { pause-after: 10ms } /* pause-before: ?; pause-after: 10ms */
Value: <url> | none
Initial: none
Applies to: all elements
Inherited: no
Percentage values: NA
Auditory icons are another way to distinguish semantic elements. Sounds may be played before, and/or after the element to delimit it. The same sound can be used both before and after using the cue property.
Examples:
A {cue-before: url(bell.au); cue-after: url(dong.au) } H1 {cue-before url(pop.au); cue-after: url(pop.au) } H1 {cue: url(pop.au } /* same as previous */
Value: <url> | mix | none
Initial: mix
Applies to: all elements
Inherited: no
Percentage values: NA
Similar to the cue-before and cue-after properties, this indicates sound to be played during an element (ie the sound is mixed in with the speech).
Does the sound play once only, or loop? With long icons, is the sound terminated once the element is spoken or is the icon left to play on?
What happens with mixed-mode rendering if an element is displayed onscreen rather than being spoken, yet has a cue-during property?
Examples:
BLOCKQUOTE.sad {cue-during: url(violins.aiff)
Value: <words-per-minute> | x-slow | slow | medium |
fast | x-fast | faster | slower
Initial: medium
Applies to: all elements
Inherited: yes
Percentage values: NA
Specifies the speaking rate. Note that both absolute and relative keyword values are allowed (compare with 'font-weight').
Value: [[<specific-voice> | <generic-voice>],]* [<specific-voice> | <generic-voice>]
Initial: UA
Applies to: all elements
Inherited: yes
Percentage values: NA
The value is a prioritized list of voice family names. Suggested genric families: male, female, child.
Examples of specific voice families are: comedian, paul, lisa
Examples
H1 { voice-family: announcer, male } P.part.romeo { voice-family: romeo, male } P.part.juliet { voice-family: juliet, female }
Value: <hertz> | x-low | low | medium | high | x-high
Initial: medium
Applies to: all elements
Inherited: yes
Percentage values: NA
Specifies the average pitch of the speaking voice in hertz (hz).
Do we need more keyword values? How about 'medium-low' and medium-high'? Or 'soprano', 'mezzo-soprano', 'alto', 'tenor', 'baritone', and 'bass'?
Value: <percentage>
Initial: medium
Applies to: all elements
Inherited: yes
Percentage values: relative to..
Specifies variation in average pitch. A pitch range of 0 produces a flat, monotonic voice. A pitch range of 50 produces normal inflection. Pitch ranges greater than 50 produce animated voices.
Value: <percentage>
Initial: medium
Applies to: all elements
Inherited: yes
Percentage values: Relative to...
Specifies the level of stress (assertiveness or emphasis) of the speaking voice. English is a stressed language, and different parts of a sentence are assigned primary, secondary or tertiary stress. The value of property 'stress' controls the amount of inflection that results from these stress markers. (Different speech devices may require the setting of one or more device-specific parameters to achieve this effect).
Increasing the value of this property results in the speech being more strongly inflected. It is in a sense dual to property :pitch-range and is provided to allow developpers to exploit higher-end auditory displays.
Combine 'pitch-range' and 'stress' into one property 'inflection'?
Value: <percentage>
Initial: medium (50%)
Applies to: all elements
Inherited: yes
Percentage values: Relative to...
Specifies the richness (brightness) of the speaking voice. Different speech devices may require the setting of one or more device-specific parameters to achieve this effect.
The effect of increasing richness is to produce a voice that carries --reducing richness produces a soft, mellifluous voice.
The following four properities are very prelimiary; discussion is invited:
Value: code | none
Initial: none
Applies to: all elements
Inherited: yes
Percentage values: NA
'code' is used to read all punctuation such as semicolons, braces, and so on. The default value of 'none' means that punctuation is not spoken but instead is rendered naturally as various pauses.
Value: myd | dmy | ymd | none
Initial: none
Applies to: all elements
Inherited: yes
Percentage values: NA
This is a hint that the element contains a date and also how that date should be spoken. month-day-year is common in the USA, while day-month-year is common in Europe and year-month-day is also used.
Value: digits | continous
Initial: none
Applies to: all elements
Inherited: yes
Percentage values: NA
Value: 24 | 12 | none
Initial: none
Applies to: all elements
Inherited: yes
Percentage values: NA