NOTE-ACSS-970107

Aural Cascading Style Sheets (ACSS)

W3C NOTE 07-January-1997

This version:: http://www.w3.org/pub/WWW/Style/CSS/Speech/NOTE-ACSS-970107
Previous version:: http://www.w3.org/pub/WWW/Style/CSS/Speech/NOTE-ACSS-961210
Latest version:: http://www.w3.org/pub/WWW/Style/CSS/Speech/NOTE-ACSS

Editor:: Chris Lilley, W3C

Status of this document

This document is a W3C NOTE for review by W3C members and other interested parties. It is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to use W3C NOTES as reference material or to cite them as other than "work in progress". A list of current W3C working drafts and notes can be found at: http://www.w3.org/pub/WWW/TR/

This document draws very heavily on the initial CSS properties proposed by T.V. Raman. It also attempts to address those issues, raised on the www-style mailing list, that were altered in the later modification but does not address them in the same way.

This document builds upon the CSS1 specification.

Note: since working drafts and notes are subject to frequent change, you are advised to reference the above URL, rather than the URLs for working drafts themselves.

1 Aural presentation

Those of us who are sighted are accustomed to visual presentation of HTML documents, frequently on a bitmapped display. This is not the only possible presentation method, however. Aural presentation, using a combination of speech synthesis and 'audio icons', provides an alternative presentation. This form of presentation is in current use by the blind and print-impaired communities.

Often such aural presentation occurs by converting the document to plain text and feeding this to a 'screen reader' -- software or hardware that simply reads all the characters on the screen. This results in less effective presentation than would be the case if the document structure were retained.

There are other large markets for aural presentation, including in-car and home entertainment use; aurual or mixed aural/visual presentation is thus likely to increase in importance over the next few years. Realizing that that the aural rendering is essentially independent of the visual rendering:

Allows orthogonal aural and visual views.
Allows browsers to optionally implement both aural and visual views to produce truly multimodal documents.

Given this current and future importance, it makes sense to influence presentation based on the structure of the document. Cascading Style Sheets [1] may be used for this purpose. Using style sheets rather than HTML tag extensions allows the same document to be read with visual, aural, or mulitmodal presentation without cluttering up the document or having to produce three (or more) separate parallel documents which has been shown to result in update problems. This approach provides greatly improved document accessibility for visually disabled people without requiring compromises in the visual design of the document.

2 Aural presentation with CSS

This section extends the CSS1 specification to allow additional types of value.

Style sheets influence the presentation of documents by assigning values to style properties. This section lists the defined style properties for aural presentation, and their corresponding list of possible values, expressed in CSS1 syntax.

The list of CSS1 properties has been kept to a minimum, while making sure commonly used styles can be expressed. Depending on the formatting model and the presentation medium, some properties can prove hard to incorporate into existing UA implementations. E.g., a monaural browser is not able to fully honor spatial audio, but should approximate by mixing all the channels together.

2.1 Notation for property values

In the text below, the allowed values for each property are listed with a syntax like the following:

Value: N | NW | NE
Value: [ <length> | thick | thin ]{1,4}
Value: <uri>? <color> [ / <color> ]?
Value: <uri> || <color>

The words between < and > give a type of value. The most common types are <length>, <percentage>, <url>, <number> and <color> these are described in the section on units. The more specialized types (e.g. <font-family> and <border-style>) are described under the property where they appear.

Other words are keywords that must appear literally, without quotes. The slash (/) and the comma (,) must also appear literally.

Several things juxtaposed mean that all of them must occur, in the given order. A bar (|) separates alternatives: one of them must occur. A double bar (A || B) means that either A or B or both must occur, in any order. Brackets ([]) are for grouping. Juxtaposition is stronger than the double bar, and the double bar is stronger than the bar. Thus "a b | c || d e" is equivalent to "[ a b ] | [ c || [ d e ]]".

Every type, keyword, or bracketed group may be followed by one of the following modifiers:

An asterisk (*) indicates that the preceding type, word or group is repeated zero or more times.
A plus (+) indicates that the preceding type, word or group is repeated one or more times.
A question mark (?) indicates that the preceding type, word or group is optional.
A pair of numbers in curly braces ({A,B}) indicates that the preceding type, word or group is repeated at least A and at most B times.

2.2 New Units/Values

This specification introduces two new units in addition to the units of CSS1.

2.2.1 Angle units

These are the legal angle units:

deg
grad
rad

2.2.2 Time units

These are the legal time units:

ms: milliseconds
s: seconds

3 General audio properties

3.1 'volume'

The legal range of percentage values is 0% to 100%. There is a fixed mapping between keyword values and percentages:

'x-soft' = '0%'
'soft' = '25%'
'medium' = '50%'
'loud' = '75%'
'x-loud' = '100%'

Volume refers to the median volume of the waveform. In other words, a highly inflected voice at a volume of 50 might peak well above that. Note that '0%' does not mean the same as "mute". 0% represents the minimum audible volume level and 100% corresponds to the maximum comfortable level. The UA should allow the values corresponding to 0% and 100% to be set by the user. Suitable values depend on the equipment in use (speakers, headphones), the environment (in car, home theater, library) and personal preferences. Some examples:

A browser for in-car use has a setting for when there is lots of background noise . 0% would map to a fairly high level and 100% to a quite high level. The overall values are likely to be human adjustable for comfort, for example with a physical volume control: what this proposal does is adjust the dynamic range.
Another speech browser is being used in the home, late at night, (don't annoy the neighbors) or in a shared study room. 0% is set to very quiet and 100% to a fairly quiet level, too. As with the first example, there is a low slope; the dynamic range is reduced. The actual volumes are low here, wheras they were high in the first example.
In a quiet and isolated house, an expensive hifi home theatre setup. 0% is set fairly low and 100% to quite high; there is wide dynamic range.

The same authors stylesheet could be used in all cases, simply by mapping the 0 and 100 points suitably at the client side.

3.2 'pause-before'

Value: <time> | <percentage>
Initial: UA specific
Applies to: all elements
Inherited: no
Percentage values: NA

This property specifies the pause before elements. It may be given in an absolute units (seconds, milliseconds) or as a relative value in which case it is relative to the reciprocal of the 'speed' property: if speed is 120 words per minute (ie a word takes half a second -- 500 milliseconds) then a pause-before of 100% means a pause of 500 ms and a pause-before of 20% means 100ms.

Using relative units gives more robust stylesheets in the face of large changes in speed.

3.3 'pause-after'

Value: <time> | <percentage>
Initial: UA specific
Applies to: all elements
Inherited: no
Percentage values: NA

This property specifies the pause after elements. Values are specified the same way as 'pause-before'.

3.4 'pause'

Value: [<timee> | <percentage> ]{1,2};
Initial: UA specific
Applies to: all elements
Inherited: no
Percentage values: NA

The 'pause' property is a shorthand for setting 'pause-before' and 'pause-after'. The first value is pause-before and the second is pause-after. If only one value is given, it applies to both properties.

Examples:

  H1 { pause: 20ms } /* pause-before: 20ms; pause-after: 20ms */
  H2 { pause: 30ms 40ms } /* pause-before: 30ms; pause-after: 40ms */
  H3 { pause-after: 10ms } /* pause-before: ?; pause-after: 10ms */

3.5 'cue', 'cue-before', 'cue-after'

Value: <url> | none
Initial: none
Applies to: all elements
Inherited: no
Percentage values: NA

Auditory icons are another way to distinguish semantic elements. Sounds may be played before, and/or after the element to delimit it. The same sound can be used both before and after, using the cue property.

Examples:

  A {cue-before: url(bell.aiff); cue-after: url(dong.wav) }
  H1 {cue-before: url(pop.au); cue-after: url(pop.au) }
  H1 {cue: url(pop.au) }  /* same as previous */

3.6 'play-during'

Value: <url> | mix | none
Initial: mix
Applies to: all elements
Inherited: no
Percentage values: NA

Similar to the cue-before and cue-after properties, this indicates sound to be played during an element as a background (ie the sound is mixed in with the speech).

Does the sound play once only, or loop? With long icons, is the sound terminated once the element is spoken or is the icon left to play on?

What happens with mixed-mode rendering if an element is displayed onscreen rather than being spoken, yet has a cue-during property?

Examples:

  BLOCKQUOTE.sad {cue-during: url(violins.aiff)}

4 Spatial properties

Spatial audio is an important stylistic property for aural presentation. It provides a natural way to tell several voices apart, the same way we use in real life (people rarely all stand in the same spot of a room). Stereo speakers produce a lateral soundstage. Binaural headphones or the increasingly popular 5-speaker home theatre setups can generate full surround sound, and multi-speaker setups can create a true three-dimensional soundstage. VRML 2.0 also includes spatial audio, which implies that spatial audio hardware will become more widely available.

4.1 'azimuth'

Value: <angle>| [[left-side | far-left | left | center-left | center | center-right | right | far-right | right-side] || behind ]
Initial: 0deg
Applies to: all elements
Inherited: yes

The value is given in the range 0deg <= x < 360deg where 0deg is interpreted as directly ahead in the center of the sound stage. 90deg is to the right, 180deg behind and 270deg to the left. It may also be specified using keywords:

keyword	value	value with 'behind'
left-side	270deg	270deg
far-left	300deg	240deg
left	320deg	220deg
center-left	340deg	200deg
center	0deg	180deg
center-right	20deg	160deg
right	40deg	140deg
far-right	60deg	120deg
right-side	90deg	90deg

This property is most likely to be implemented by mixing the same signal into different channels at differing volumes. It might also use phase shifting, digital delay, and other such techniques to provide the illusion of a soundstage. The precise means used to acheive this effect and the number of speakers used to do so are browser dependent - this property merely identifies the desired end result. Examples:

  H1   { azimuth: 30deg }          
  TD.a { azimuth: far-right }          /*  60deg */
  #12  { azimuth: behind far-right }   /* 120deg */
  P.comment { azimith: behind }        /* 180deg */

Do we need relative values like "more-to-the-right" ?

UAs should attempt to honor this request if they have resources to do so. If spatial-azimuth is specified and the output device cannot produce sounds behind the listening position, values in the rearwards hemisphere should be converted into forwards hemisphere values. One method is as follows:

if 90deg < x <= 180deg then x := 180deg - x
if 180deg < x <= 270deg then x := 540deg - x

4.2 'elevation'

The value is given in degrees in the range -90deg to 90deg. 0deg is interpreted as on the forward horizon, which loosely means level with the listener. 90deg is directly overhead and -90 is directly underneath. The precise means used to acheive this effect and the number of speakers used to do so are undefined. This property merely identifies the desired end result. UAs should attempt to honor this request if they have resources to do so. Examples:

        
  H1   { elevation: above }   
  TR.a { elevation: 60deg }
  TR.b { elevation: 30deg }
  TR.c { elevation: level }

5 Speech properties

5.1 'speed' (or 'speech-rate' ?)

Specifies the speaking rate. Note that both absolute and relative keyword values are allowed (compare with 'font-weight').

5.2 'voice-family'

Value: [[<specific-voice> | <generic-voice>],]* [<specific-voice> | <generic-voice>]
Initial: UA
Applies to: all elements
Inherited: yes
Percentage values: NA

The value is a prioritized list of voice family names. Suggested genric families: male, female, child.

Examples of specific voice families are: comedian, paul, lisa

Examples

  H1 { voice-family: announcer, male }
  P.part.romeo {  voice-family: romeo, male }
  P.part.juliet { voice-family: juliet, female }

5.3 'pitch' (or 'average pitch' ?, what is the relationship with voice-family?)

Specifies the average pitch of the speaking voice in hertz (hz).

Do we need more keyword values? How about 'medium-low' and medium-high'? Or 'soprano', 'mezzo-soprano', 'alto', 'tenor', 'baritone', and 'bass'?

5.4 'pitch-range' (could be combined with 'pitch' ?) or 'inflection'

Value: <percentage>
Initial:50%
Applies to: all elements
Inherited: yes
Percentage values: relative to..

Specifies variation in average pitch. A pitch range of 0% produces a flat, monotonic voice. A pitch range of 50% produces normal inflection. Pitch ranges greater than 50% produce animated voices.

5.5 'stress'

Value: <percentage>
Initial: medium
Applies to: all elements
Inherited: yes
Percentage values: Relative to...

Specifies the level of stress (assertiveness or emphasis) of the speaking voice. English is a stressed language, and different parts of a sentence are assigned primary, secondary or tertiary stress. The value of property 'stress' controls the amount of inflection that results from these stress markers.

Increasing the value of this property results in the speech being more strongly inflected. It is in a sense dual to property :pitch-range and is provided to allow developers to exploit higher-end auditory displays.

Combine 'pitch-range' and 'stress' into one property 'inflection'?

5.6 'richness' ('brightness' ?)

Value: <percentage>
Initial: medium (50%)
Applies to: all elements
Inherited: yes
Percentage values: Relative to...

Specifies the richness (brightness) of the speaking voice. Different speech devices may require the setting of one or more device-specific parameters to achieve this effect.

The effect of increasing richness is to produce a voice that carries --reducing richness produces a soft, mellifluous voice.

The following four properities are very prelimiary; discussion is invited:

5.7 'speak-punctuation'

Value: code | none
Initial: none
Applies to: all elements
Inherited: yes
Percentage values: NA

'code' indicates that punctuation such as semicolons, braces, and so on are to be spoken literally. The default value of 'none' means that punctuation is not spoken but instead is rendered naturally as various pauses.

5.8 'speak-date'

Value: myd | dmy | ymd | none
Initial: none
Applies to: all elements
Inherited: no
Percentage values: NA

This is a hint that the element contains a date and also how that date should be spoken. month-day-year is common in the USA, while day-month-year is common in Europe and year-month-day is also used.

This should really bean HTML tag not a stylesheet property, since it gives semantic information about the content.

5.9 'speak-numeral'

Value: digits | continous
Initial: none
Applies to: all elements
Inherited: yes
Percentage values: NA

5.10 'speak-time'

Value: 24 | 12 | none
Initial: none
Applies to: all elements
Inherited: yes
Percentage values: NA

6 References

[1] Cascading Style Sheets, level 1

Chris Lilley (editor)
Created: 28-Feb-1996
Last modified: $Date: 1997/01/08 00:31:42 $