WD-acss-970630

Aural Cascading Style Sheets (ACSS)

W3C Working Draft 30-June-1997

This version:: http://www.w3.org/Style/Group/WD-acss-970630
Previous public version:: http://www.w3.org/TR/WD-acss-970606
Previous (member only) version:: http://www.w3.org/Style/Group/WD-acss

Authors:: Chris Lilley, W3C; T. V. Raman, Adobe

Status of this document

This document is an intermediate draft produced by the W3C CSS&FP Working Group as part of the Stylesheets Activity; it is stable enough to be released for public comment (to www-style@w3.org) but may change before approval as (part of) a recommendation. Hence it should not be implemented as part of a production system, but may be implemented on an experimental basis.

This is a W3C Working Draft for review by W3C members and other interested parties. It is a draft document and may be updated, replaced or obsoleted by other documents at any time. When referenced, W3C Working Drafts should be cited as "work in progress". A list of current W3C technical reports can be found at http://www.w3.org/pub/WWW/TR.

This document draws very heavily on the initial CSS properties proposed by T.V. Raman. It also attempts to address those issues, raised on the www-style mailing list, that were altered in the later modification but does not address them in the same way.

This document builds upon the CSS1 specification.

1 Aural presentation

Those of us who are sighted are accustomed to visual presentation of HTML documents, frequently on a bitmapped display. This is not the only possible presentation method, however. Aural presentation, using a combination of speech synthesis and 'audio icons', provides an alternative presentation. This form of presentation is already in current use by the blind and print-impaired communities.

Often such aural presentation occurs by converting the document to plain text and feeding this to a 'screen reader' -- software or hardware that simply reads all the characters on the screen. This results in less effective presentation than would be the case if the document structure were retained. A benefit of separating the content (the HTML) and the visual presentation (the stylesheet) is that other types of presentation can also be offered as options (other stylesheets). Stylesheet properties for aural presentation can be used together with visual properties (mixed media) or as an aural alternative to visual presentation.

Besides the obvious accessibility issues for the blind, there are other large markets for aural presentation:

in-car use: keep your eyes on the road ahead, Jack, and search the web for recommended hotels in the next town up ahead
industrial and medical documentation systems (intranets): my hands and eyes are otherwise occupied with your triple bypass but I would still like your medication records
home entertainment: images, headlines, movies are fine on the widescreen TV but I don't want to read body text off the screen from the couch; speak it to me (perhaps throuh the 5 speaker home theater set-up)
the illiterate: I understand everything you say, but I don't read too well

Hence, aural or mixed aural/visual presentation is likely to increase in importance over the next few years. Realizing that the aural rendering is essentially independent of the visual rendering:

Allows orthogonal aural and visual views.
Allows browsers to optionally implement both aural and visual views to produce truly multimodal documents.

Given this current and future importance, it makes sense to influence presentation based on the structure of the document. Cascading Style Sheets [1] may be used for this purpose. Using style sheets rather than HTML tag extensions allows the same document to be read with visual, aural, or mulitmodal presentation without cluttering up the document or having to produce three (or more) separate parallel documents - which has been shown to result in consistency and update problems. This approach provides greatly improved document accessibility for visually disabled people (the information is better presented and is just as up-to-date as the visual version) without requiring compromises in the visual design of the document.

2 Aural presentation with CSS

This section extends the CSS1 specification to allow additional types of value.

Style sheets influence the presentation of documents by assigning values to style properties. This section lists the defined style properties for aural presentation, and their corresponding list of possible values, expressed in CSS1 syntax.

The list of CSS1 properties has been kept to a minimum, while making sure commonly used styles can be expressed. Depending on the formatting model and the presentation medium, some properties can prove hard to incorporate into existing UA implementations. E.g., a monaural browser is not able to fully honor spatial audio, but should approximate by mixing all the channels together.

2.1 Notation for property values

In the text below, the allowed values for each property are listed with a syntax like the following (this is the same syntax as CSS1):

Value: N | NW | NE
Value: [ <length> | thick | thin ]{1,4}
Value: <uri>? <color> [ / <color> ]?
Value: <uri> || <color>

The words between < and > give a type of value. This specification introduces some new units for property values.

Other words are keywords that must appear literally, without quotes. The slash (/) and the comma (,) must also appear literally.

Several things juxtaposed mean that all of them must occur, in the given order. A bar (|) separates alternatives: one of them must occur. A double bar (A || B) means that either A or B or both must occur, in any order. Brackets ([]) are for grouping. Juxtaposition is stronger than the double bar, and the double bar is stronger than the bar. Thus "a b | c || d e" is equivalent to "[ a b ] | [ c || [ d e ]]".

Every type, keyword, or bracketed group may be followed by one of the following modifiers:

An asterisk (*) indicates that the preceding type, word or group is repeated zero or more times.
A plus (+) indicates that the preceding type, word or group is repeated one or more times.
A question mark (?) indicates that the preceding type, word or group is optional.
A pair of numbers in curly braces ({A,B}) indicates that the preceding type, word or group is repeated at least A and at most B times.

2.2 New Units/Values

This specification introduces several new units in addition to the units of CSS1.

2.2.1 Angle units

These are the legal angle units:

deg: degrees
grad: gradians
rad: radians

Values in these units may be negative. They should be normalised to the range 0-360deg by the UA. For example, 10deg and 350 deg are equivalent.

2.2.2 Time units

These are the legal time units:

ms: milliseconds
s: seconds

Time values may not be negative.

2.2.3 Frequency unit

There are two legal frequency units

Hz: Hertz
kHz: kiloHertz

Example: 200Hz is a bass sound, and 6kHz is a trebble sound.

3 General audio properties

3.1 'volume'

The legal range of percentage values is 0% to 100%. Note that '0%' does not mean the same as "silent". 0% represents the minimum audible volume level and 100% corresponds to the maximum comfortable level. There is a fixed mapping between keyword values and percentages:

'silent' = no sound at all, the element is spoken silently
'x-soft' = '0%'
'soft' = '25%'
'medium' = '50%'
'loud' = '75%'
'x-loud' = '100%'

Volume refers to the median volume of the waveform. In other words, a highly inflected voice at a volume of 50 might peak well above that. The overall values are likely to be human adjustable for comfort, for example with a physical volume control (which would increase both the 0% and 100% values proportionately); what this property does is adjust the dynamic range.

The UA should allow the values corresponding to 0% and 100% to be set by the listener. No one setting is universally applicable; suitable values depend on the equipment in use (speakers, headphones), the environment (in car, home theater, library) and personal preferences. Some examples:

A browser for in-car use has a setting for when there is lots of background noise. 0% would map to a fairly high level and 100% to a quite high level. The speech is easily audible over the road noise but the overall dynamic range is compressed. Plusher cars with better insulation allow a wider dynamic range.
Another speech browser is being used in the home, late at night, (don't annoy the neighbors) or in a shared study room. 0% is set to a very quiet level and 100% to a fairly quiet level, too. As with the first example, there is a low slope; the dynamic range is reduced. The actual volumes are low here, wheras they were high in the first example.
In a quiet and isolated house, an expensive hi-fi home theatre setup. 0% is set fairly low and 100% to quite high; there is wide dynamic range.

The same authors stylesheet could be used in all cases, simply by mapping the 0 and 100 points suitably at the client side.

If an element has a volume of silent, it is spoken silently. It takes up the same time as if it had been spoken, including any pause before and after the element, but no sound is generated. This may be used in language teaching applications, for example. A pause is gdenerated for the pupil to speak the element themselves. Note: the value is inherited so child elements will also be silent. Child elements may however set the volume to a non-silent value and will then be spoken.

To inhibit the speaking of an element and all it's children so that it takes no time at all (for example, to get the effect of collapsing and expanding lists) use the CSS1 property 'display'

display: none

When using the rule display: none the element takes up no time; it is not represented as a pause the length of the spoken text.

3.2 'pause-before'

Value: <time> | <percentage>
Initial: UA specific
Applies to: all elements
Inherited: no
Percentage values: see text

This property specifies the pause before an element is spoken. It may be given in an absolute units (seconds, milliseconds) or as a relative value - in which case it is relative to the reciprocal of the 'speech-rate' property: if speech-rate is 120 words per minute (ie a word takes half a second, 500 milliseconds) then a pause-before of 100% means a pause of 500 ms and a pause-before of 20% means 100ms.

Using relative units gives more robust stylesheets in the face of large changes in speech-rate and is recommended practice.

3.3 'pause-after'

Value: <time> | <percentage>
Initial: UA specific
Applies to: all elements
Inherited: no
Percentage values: see text

This property specifies the pause after an element is spoken. Values are specified the same way as 'pause-before'.

3.4 'pause'

Value: [<time> | <percentage> ]{1,2};
Initial: UA specific
Applies to: all elements
Inherited: no
Percentage values: see text for pause-before, pause-after

The 'pause' property is a shorthand for setting 'pause-before' and 'pause-after'. If two values are given, the first value is pause-before and the second is pause-after. If only one value is given, it applies to both properties.

Examples:

  H1 { pause: 20ms } /* pause-before: 20ms; pause-after: 20ms */
  H2 { pause: 30ms 40ms } /* pause-before: 30ms; pause-after: 40ms */
  H3 { pause-after: 10ms } /* pause-before: ?; pause-after: 10ms */

3.5 'cue', 'cue-before', 'cue-after'

Value: <url> | none
Initial: none
Applies to: all elements
Inherited: no
Percentage values: N/A

Auditory icons are another way to distinguish semantic elements. Sounds may be played before, and/or after the element to delimit it. The same sound can be used both before and after, using the shorthand 'cue' property.

Examples:

  A {cue-before: url(bell.aiff); cue-after: url(dong.wav) }
  H1 {cue-before: url(pop.au); cue-after: url(pop.au) }
  H1 {cue: url(pop.au) }  /* same as previous */

The :before and :after pseudo-elements (see frostings document) could be used to generate this content, rather than using two special-purpose properties. This would be more general.

3.6 'play-during'

Value: <url> mix? repeat? | auto | none
Initial: auto
Applies to: all elements
Inherited: no
Percentage values: N/A

Similar to the cue-before and cue-after properties, this indicates sound to be played during an element as a background (ie the sound is mixed in with the speech).

The optional 'mix' keyword means the sound inherited from the parent element's play-during property continues to play and the current element sound (pointed to by the URL) is mixed with it. If mix is not specified, the sound replaces the sound of the parent element.

The optional 'repeat' keyword means the sound will repeat if it is too short to fill the entire duration of the element. Without this keyword, the sound plays once and then stops. Thuis is similar to the background repeat properties in CSS1. If the sound is too long for the element, it is clipped once the element is spoken.

Auto means that the sound of the parent element continues to play (it is not restarted, which would have been the case if this property inherited)and none means that there is silence - the sound of the parent element (if any) is silent for the current element and continues after the current element.

Examples:

  BLOCKQUOTE.sad {play-during: url(violins.aiff) }
  BLOCKQUOTE Q {play-during: url(harp.wav) mix}
  SPAN.quiet {play-during: none }

Note: If a stereo icon is dereferenced the central point of the stereo pair should be placed at the azimuth for that element and the left and right channels should be placed to either side of this position.

4 Spatial properties

Spatial audio is an important stylistic property for aural presentation. It provides a natural way to tell several voices apart, the same way we use in real life (people rarely all stand in the same spot in a room). Stereo speakers produce a lateral soundstage. Binaural headphones or the increasingly popular 5-speaker home theatre setups can generate full surround sound, and multi-speaker setups can create a true three-dimensional soundstage. VRML 2.0 also includes spatial audio (and uses the same azimuth and elevation terms, which originate in astronomy), which implies that in time consumer-priced spatial audio hardware will become more widely available.

4.1 'azimuth'

Value: <angle>| [[left-side | far-left | left | center-left | center | center-right | right | far-right | right-side] || behind ] | leftwards | rightwards
Initial: center
Applies to: all elements
Inherited: yes

The value is given in the range -360deg <= x < 360deg where 0deg is interpreted as directly ahead in the center of the sound stage. 90deg is to the right, 180deg behind and 270deg (or, equivalently and more conveniently, -90deg) to the left. It may also be specified using absolute keywords:

keyword	value	value with 'behind'
left-side	270deg	270deg
far-left	300deg	240deg
left	320deg	220deg
center-left	340deg	200deg
center	0deg	180deg
center-right	20deg	160deg
right	40deg	140deg
far-right	60deg	120deg
right-side	90deg	90deg

or relative keywords. The value leftwards moves the sound more to the left (subtracts 20 degrees) while the value rightwards moves the sound more to the right (adds 20 degrees). Arithmetic is carried out modulo 360 degrees.

This property is most likely to be implemented by mixing the same signal into different channels at differing volumes. It might also use phase shifting, digital delay, and other such techniques to provide the illusion of a soundstage. The precise means used to acheive this effect and the number of speakers used to do so are browser dependent - this property merely identifies the desired end result. Examples:

  H1   { azimuth: 30deg }          
  TD.a { azimuth: far-right }          /*  60deg */
  #12  { azimuth: behind far-right }   /* 120deg */
  P.comment { azimith: behind }        /* 180deg */

UAs should attempt to honor this request if they have resources to do so. If spatial-azimuth is specified and the output device cannot produce sounds behind the listening position, values in the rearwards hemisphere should be converted into forwards hemisphere values. One method is as follows:

if 90deg < x <= 180deg then x := 180deg - x
if 180deg < x <= 270deg then x := 540deg - x

4.2 'elevation'

The value is given in degrees in the range -90deg to 90deg. 0deg is interpreted as on the forward horizon, which loosely means level with the listener. 90deg is directly overhead and -90 is directly underneath. The precise means used to acheive this effect and the number of speakers used to do so are undefined. This property merely identifies the desired end result. UAs should attempt to honor this request if they have resources to do so.

The relative keywords higher and lower add and subtract 10 degrees from the elevation, respectively. Examples:

        
  H1   { elevation: above }   
  TR.a { elevation: 60deg }
  TR.b { elevation: 30deg }
  TR.c { elevation: level }

5 Speech properties

5.1 'speech-rate'

Specifies the speaking rate. Note that both absolute and relative keyword values are allowed (compare with 'font-weight'). If a numerical value is given, it refers to words per minute, a quantity which varies somewhat by language but is nevertheless widely supported by speech synthesizers. The value 'medium' refers to the reader's preferred speech-rate setting. Relative values are more readily cascadable.

5.2 'voice-family'

Value: [[<specific-voice> | <generic-voice>],]* [<specific-voice> | <generic-voice>]
Initial: UA Specific
Applies to: all elements
Inherited: yes
Percentage values: N/A

The value is a prioritized list of voice family names (compare with 'font-family'. Suggested genric families: male, female, child.

Examples of specific voice families are: comedian, trinoids, carlos, lisa

Examples

  H1 { voice-family: announcer, male }
  P.part.romeo {  voice-family: romeo, male }
  P.part.juliet { voice-family: juliet, female }

Should the properties of these family names be described, using an @-rule, to allow better client-side matching (like fonts). If so, what are the values that describe these voice families in a way that is independent of speech synthesizer?

5.3 'pitch' (or 'average pitch' ?, what is the relationship with voice-family?)

Specifies the average pitch of the speaking voice in hertz (Hz).

5.4 'pitch-range' (could be combined with 'pitch' ?) or 'inflection'

Value: <percentage>
Initial:50%
Applies to: all elements
Inherited: yes
Percentage values: relative to..

Specifies variation in average pitch. A pitch range of 0% produces a flat, monotonic voice. A pitch range of 50% produces normal inflection. Pitch ranges greater than 50% produce animated voices.

5.5 'stress'

Value: <percentage>
Initial: medium
Applies to: all elements
Inherited: yes
Percentage values: Relative to...

Specifies the level of stress (assertiveness or emphasis) of the speaking voice. English is a stressed language, and different parts of a sentence are assigned primary, secondary or tertiary stress. The value of property 'stress' controls the amount of inflection that results from these stress markers.

Increasing the value of this property results in the speech being more strongly inflected. It is in a sense dual to property :pitch-range and is provided to allow developers to exploit higher-end auditory displays.

Combine 'pitch-range' and 'stress' into one property 'inflection'?

5.6 'richness' ('brightness' ?)

Value: <percentage>
Initial: 50%
Applies to: all elements
Inherited: yes
Percentage values: Relative to...

Specifies the richness (brightness) of the speaking voice. The effect of increasing richness is to produce a voice that carries --reducing richness produces a soft, mellifluous voice.

The following four properities are very prelimiary; discussion is invited:

5.7 'speak-punctuation'

Value: code | none
Initial: none
Applies to: all elements
Inherited: yes
Percentage values: N/A

'code' indicates that punctuation such as semicolons, braces, and so on are to be spoken literally. The default value of 'none' means that punctuation is not spoken but instead is rendered naturally as various pauses.

5.8 'speak-date'

Value: myd | dmy | ymd
Initial: UA Specific
Applies to: all elements
Inherited: yes
Percentage values: N/A

This is a request about how any dates should be spoken. month-day-year is common in the USA, while day-month-year is common in Europe and year-month-day is also used.

This would be most useful when combined with a new HTML tag used to identify dates, such as this theoretical example:

   <p>The campaign started on <date value="1874-oct-21">
   the twenty-first of that month</date> and finished 
   <date value="1874-oct-28">a week later</date>

5.9 'speak-numeral'

Value: digits | continous | none
Initial: none
Applies to: all elements
Inherited: yes
Percentage values: N/A

5.10 'speak-time'

Value: 24 | 12 | none
Initial: none
Applies to: all elements
Inherited: yes
Percentage values: N/A

6 References

[1] Cascading Style Sheets, level 1

7 Further Reading

A review of the Spatial Audio Literature

The VRML 2.0 Sound node and comments on spatial audio in VRML

Study of nformation presentation usingmultiple perceptually distinguishable auditory streams

Spatial audio and other aural styling used to present mathematical documents

Speech Synthesis Markup Language

The Speech Synthesis Museum

The Emacspeak system

Macintosh speech resources and applications

Presenting HTML Structure in Audio: User Satisfaction with Audio Hypertext (good references)

Chris Lilley (editor)
Created: 28-Feb-1996
Last modified: $Date: 1997/06/30 04:18:51 $