19 Aural style sheets

Contents

  1. Aural cascading style sheet properties
    1. Volume properties: 'volume'
    2. Speaking properties: 'speak'
    3. Pause properties: 'pause-before', 'pause-after', and 'pause'
    4. Cue properties: 'cue-before', 'cue-after', and 'cue'
    5. Mixing properties: 'play-during'
    6. Spatial properties: 'azimuth' and 'elevation'
    7. Voice characteristic properties: 'speech-rate', 'voice-family', 'pitch', 'pitch-range', 'stress', 'richness', 'speak-punctuation', 'speak-date', 'speak-numeral', and 'speak-time'

Those of us who are sighted are accustomed to visual presentation of documents, frequently on a bitmapped display. This is not the only possible presentation method, however. Aural presentation, using a combination of speech synthesis and 'audio icons', provides an alternative presentation. This form of presentation is already in current use by the blind and print-impaired communities.

Often such aural presentation occurs by converting the document to plain text and feeding this to a 'screen reader' -- software or hardware that simply reads all the characters on the screen. This results in less effective presentation than would be the case if the document structure were retained. A benefit of separating the content (e.g., the HTML) and the visual presentation (the stylesheet) is that other types of presentation can also be offered as options (other stylesheets). Stylesheet properties for aural presentation can be used together with visual properties (mixed media) or as an aural alternative to visual presentation.

Besides the obvious accessibility issues for the blind, there are other large markets for aural presentation:

in-car use
keep your eyes on the road ahead, Jack, and search the web for recommended hotels in the next town up ahead
industrial and medical documentation systems (intranets)
my hands and eyes are otherwise occupied with your triple bypass but I would still like your medication records
home entertainment
images, headlines, movies are fine on the wide-screen TV but I don't want to read body text off the screen from the couch; speak it to me (perhaps through the 5 speaker home theater set-up)
the illiterate
I understand everything you say, but I don't read very well

Hence, aural or mixed aural/visual presentation is likely to increase in importance over the next few years. Realizing that the aural rendering is essentially independent of the visual rendering:

19.1 Aural cascading style sheet properties

19.1.1 Volume properties: 'volume'

'volume'

Property name:'volume' 
Value:<number> | silent | x-soft | soft | medium | loud | x-loud
Initial:medium
Applies to:all elements
Inherited:yes
Percentage values:relative to inherited value

The legal range of numerical values is 0 to 100. Note that '0' does not mean the same as "silent". 0 represents the minimum audible volume level and 100 corresponds to the maximum comfortable level.

Percentage values are calculated relative to the inherited value, and are then clipped to the range 0 to 100.

There is a fixed mapping between keyword values and volumes:

Volume refers to the median volume of the waveform. In other words, a highly inflected voice at a volume of 50 might peak well above that. The overall values are likely to be human adjustable for comfort, for example with a physical volume control (which would increase both the 0 and 100 values proportionately); what this property does is adjust the dynamic range.

The UA should allow the values corresponding to 0 and 100 to be set by the listener. No one setting is universally applicable; suitable values depend on the equipment in use (speakers, headphones), the environment (in car, home theater, library) and personal preferences. Some examples:

The same authors stylesheet could be used in all cases, simply by mapping the 0 and 100 points suitably at the client side.

19.1.2 Speaking properties: 'speak'

'speak'

Property name:'speak' 
Value:normal | none | spell-out
Initial:normal
Applies to:all elements
Inherited:yes
Percentage values:N/A

This property specifies whether text will be rendered aurally and if so, in what manner (somewhat analogous to the 'display' property). The possibles values are:

none
Suppresses aural rendering so that, unless overridden recursively, the element and its children require no time to render.
normal
Uses regular language-dependent pronunciation rules for rendering an element and its children.
spell-out
Spells the text one letter at a time (useful for acronyms and abbreviations).

Note the difference between an element whose 'volume' property has a value of 'silent' and an element whose 'speak' property has the value 'none':

The former takes up the same time as if it had been spoken, including any pause before and after the element, but no sound is generated. This may be used in language teaching applications, for example. A pause is generated for the pupil to speak the element themselves. Note that since the value of this property is inherited, child elements will also be silent. Child elements may however set the volume to a non-silent value and will then be spoken.

Elements whose 'speak' property has the value 'none' are not spoken and take no time. Child elements may however override this value and may be spoken normally.

19.1.3 Pause properties: 'pause-before', 'pause-after', and 'pause'

'pause-before'

Property name:'pause-before' 
Value:<time> | <percentage>
Initial:depends on user-agent
Applies to:all elements
Inherited:no
Percentage values:see description below

The 'pause-before' property specifies the pause before an element is spoken. It may be given in an absolute units (seconds, milliseconds) or as a relative value, in which case it is relative to the reciprocal of the 'speech-rate' property. If speech-rate is 120 words per minute (i.e., a word takes half a second, 500 milliseconds) then a 'pause-before' of 100% means a pause of 500 ms and a 'pause-before' of 20% means 100ms.

Using relative units gives more robust stylesheets in the face of large changes in speech-rate and is recommended practice.

'pause-after'

Property name:'pause-after' 
Value:<time> | <percentage>
Initial:depends on user-agent
Applies to:all elements
Inherited:no
Percentage values:see description below

This property specifies the pause after an element is spoken. Values are specified the same way as 'pause-before'.

'pause'

Property name:'pause' 
Value:[<time> | <percentage>]{1,2}
Initial:depends on user-agent
Applies to:all elements
Inherited:no
Percentage values:see descriptions of 'pause-before' and 'pause-after'

The 'pause' property is a shorthand for setting 'pause-before' and 'pause-after'. If two values are given, the first value is 'pause-before' and the second is 'pause-after'. If only one value is given, it applies to both properties.

Examples:

  H1 { pause: 20ms } /* pause-before: 20ms; pause-after: 20ms */
  H2 { pause: 30ms 40ms } /* pause-before: 30ms; pause-after: 40ms */
  H3 { pause-after: 10ms } /* pause-before: ?; pause-after: 10ms */

19.1.4 Cue properties: 'cue-before', 'cue-after', and 'cue'

'cue-before'

Property name:'cue-before' 
Value:<url> | none
Initial:none
Applies to:all elements
Inherited:no
Percentage values:N/A

'cue-after'

Property name:'cue-after' 
Value:<url> | none
Initial:none
Applies to:all elements
Inherited:no
Percentage values:N/A

Auditory icons are another way to distinguish semantic elements. Sounds may be played before, and/or after the element to delimit it.

For example:

  A {cue-before: url(bell.aiff); cue-after: url(dong.wav) }
  H1 {cue-before: url(pop.au); cue-after: url(pop.au) }

'cue'

Property name:'cue' 
Value:<'cue-before'> || <'cue-after'>
Initial:not defined for shorthand properties
Applies to:all elements
Inherited:no
Percentage values:N/A

The same sound can be used both before and after, using the shorthand 'cue' property.

The following two rules are equivalent:

  H1 {cue-before: url(pop.au); cue-after: url(pop.au) }
  H1 {cue: url(pop.au) }

19.1.5 Mixing properties: 'play-during'

'play-during'

Property name:'play-during' 
Value:<url> | mix? repeat? | auto | none
Initial:auto
Applies to:all elements
Inherited:no
Percentage values:N/A

Similar to the 'cue-before' and 'cue-after' properties, this indicates sound to be played during an element as a background (i.e., the sound is mixed in with the speech).

The optional 'mix' keyword means the sound inherited from the parent element's play-during property continues to play and the current element sound (pointed to by the URL) is mixed with it. If 'mix' is not specified, the sound replaces the sound of the parent element.

The optional 'repeat' keyword means the sound will repeat if it is too short to fill the entire duration of the element. Without this keyword, the sound plays once and then stops. This is similar to the background repeat properties in CSS2. If the sound is too long for the element, it is clipped once the element is spoken.

'Auto' means that the sound of the parent element continues to play (it is not restarted, which would have been the case if this property inherited)and none means that there is silence - the sound of the parent element (if any) is silent for the current element and continues after the current element.

Examples:

  BLOCKQUOTE.sad {play-during: url(violins.aiff) }
  BLOCKQUOTE Q {play-during: url(harp.wav) mix}
  SPAN.quiet {play-during: none }

If a stereo icon is dereferenced the central point of the stereo pair should be placed at the azimuth for that element and the left and right channels should be placed to either side of this position.

19.1.6 Spatial properties: 'azimuth' and 'elevation'

Spatial audio is an important stylistic property for aural presentation. It provides a natural way to tell several voices apart, the same way we use in real life (people rarely all stand in the same spot in a room). Stereo speakers produce a lateral sound stage. Binaural headphones or the increasingly popular 5-speaker home theater setups can generate full surround sound, and multi-speaker setups can create a true three-dimensional sound stage. VRML 2.0 also includes spatial audio (and uses the same azimuth and elevation terms, which originate in astronomy), which implies that in time consumer-priced spatial audio hardware will become more widely available.

'azimuth'

Property name:'azimuth' 
Value:<angle> | [[ left-side | far-left | left | center-left | center | center-right | right | far-right | right-side ] || behind ] | leftwards | rightwards
Initial:center
Applies to:all elements
Inherited:yes
Percentage values:N/A

The value is given in the range -360deg <= x < 360deg where 0deg is interpreted as directly ahead in the center of the sound stage. 90deg is to the right, 180deg behind and 270deg (or, equivalently and more conveniently, -90deg) to the left. It may also be specified using absolute keywords:

keywordvaluevalue with 'behind'
left-side 270deg 270deg
far-left 300deg 240deg
left 320deg 220deg
center-left 340deg 200deg
center 0deg 180deg
center-right 20deg 160deg
right 40deg 140deg
far-right 60deg 120deg
right-side 90deg 90deg

or relative keywords. The value leftwards moves the sound more to the left (subtracts 20 degrees) while the value rightwards moves the sound more to the right (adds 20 degrees). Arithmetic is carried out modulo 360 degrees.

This property is most likely to be implemented by mixing the same signal into different channels at differing volumes. It might also use phase shifting, digital delay, and other such techniques to provide the illusion of a sound stage. The precise means used to achieve this effect and the number of speakers used to do so are browser dependent - this property merely identifies the desired end result.

Examples:

  H1   { azimuth: 30deg }          
  TD.a { azimuth: far-right }          /*  60deg */
  #12  { azimuth: behind far-right }   /* 120deg */
  P.comment { azimuth: behind }        /* 180deg */

UAs should attempt to honor this request if they have resources to do so. If spatial-azimuth is specified and the output device cannot produce sounds behind the listening position, values in the rearwards hemisphere should be converted into forwards hemisphere values. One method is as follows:

'elevation'

Property name:'elevation' 
Value:<angle> | below | level | above | higher | lower
Initial:level
Applies to:all elements
Inherited:yes
Percentage values:N/A

The value is given in degrees in the range -90deg to 90deg. 0deg is interpreted as on the forward horizon, which loosely means level with the listener. 90deg is directly overhead and -90 is directly underneath. The precise means used to achieve this effect and the number of speakers used to do so are undefined. This property merely identifies the desired end result. UAs should attempt to honor this request if they have resources to do so.

The relative keywords higher and lower add and subtract 10 degrees from the elevation, respectively.

Examples:

        
  H1   { elevation: above }   
  TR.a { elevation: 60deg }
  TR.b { elevation: 30deg }
  TR.c { elevation: level } 

19.1.7 Voice characteristic properties: 'speech-rate', 'voice-family', 'pitch', 'pitch-range', 'stress', 'richness', 'speak-punctuation', 'speak-date', 'speak-numeral', and 'speak-time'

'speech-rate'

Property name:'speech-rate' 
Value:<number> | x-slow | slow | medium | fast | x-fast | faster | slower
Initial:medium
Applies to:all elements
Inherited:yes
Percentage values:N/A

Specifies the speaking rate. Note that both absolute and relative keyword values are allowed (compare with 'font-weight'). If a numerical value is given, it refers to words per minute, a quantity which varies somewhat by language but is nevertheless widely supported by speech synthesizers. The value 'medium' refers to the reader's preferred speech-rate setting. Relative values may be cascaded more readily.

'voice-family'

Property name:'voice-family' 
Value:[[<specific-voice> | <generic-voice> ],]* [<specific-voice> | <generic-voice> ]
Initial:depends on user agent
Applies to:all elements
Inherited:yes
Percentage values:N/A

The value is a prioritized list of voice family names (compare with 'font-family'). Suggested generic families: male, female, child.

Examples of <specific-voice>  families are: comedian, trinoids, carlos, lisa

Examples:

  H1 { voice-family: announcer, male }
  P.part.romeo  { voice-family: romeo, male }
  P.part.juliet { voice-family: juliet, female }

'pitch'

Property name:'pitch' 
Value:<frequency> | x-low | low | medium | high | x-high
Initial:medium
Applies to:all elements
Inherited:yes
Percentage values:N/A

Specifies the average pitch of the speaking voice in hertz (Hz).

'pitch-range'

Property name:'pitch-range' 
Value:<number>
Initial:50
Applies to:all elements
Inherited:yes
Percentage values:N/A

Specifies variation in average pitch. A pitch range of 0 produces a flat, monotonic voice. A pitch range of 50 produces normal inflection. Pitch ranges greater than 50 produce animated voices.

'stress'

Property name:'stress' 
Value:<number>
Initial:50
Applies to:all elements
Inherited:yes
Percentage values:N/A

Specifies the level of stress (assertiveness or emphasis) of the speaking voice. English is a stressed language, and different parts of a sentence are assigned primary, secondary or tertiary stress. The value of 'stress' controls the amount of inflection that results from these stress markers.

Increasing the value of this property results in the speech being more strongly inflected. It is in a sense dual to the 'pitch-range' property and is provided to allow developers to exploit higher-end auditory displays.

'richness'

Property name:'richness' 
Value:<number>
Initial:50
Applies to:all elements
Inherited:yes
Percentage values:N/A

Specifies the richness (brightness) of the speaking voice. The effect of increasing richness is to produce a voice that carries --reducing richness produces a soft, mellifluous voice.

The following four properties are very preliminary; discussion is invited:

'speak-punctuation'

Property name:'speak-punctuation' 
Value:code | none
Initial:none
Applies to:all elements
Inherited:yes
Percentage values:N/A

A value of 'code' indicates that punctuation such as semicolons, braces, and so on are to be spoken literally. The default value of 'none' means that punctuation is not spoken but instead is rendered naturally as various pauses.

'speak-date'

Property name:'speak-date' 
Value:myd | dmy | ymd
Initial:depends on user agent
Applies to:all elements
Inherited:yes
Percentage values:N/A

This property controls how dates should be spoken. month-day-year is common in the USA, while day-month-year is common in Europe and year-month-day is also used.

This would be useful, for example, when combined with an XML element used to identify dates, such as:

   <para>The campaign started on <date value="1874-10-21"/>
    and finished <date value="1874-10-28/"></para%gt;

'speak-numeral'

Property name:'speak-numeral' 
Value:digits | continuous | none
Initial:none
Applies to:all elements
Inherited:yes
Percentage values:N/A

This property controls whether multi-digit numerals (such as 237) are spoken as a single number (two hundred and thirty seven) or individual digits (two three seven).

'speak-time'

Property name:'speak-time' 
Value:24 | 12 | none
Initial:none
Applies to:all elements
Inherited:yes
Percentage values:N/A

This property controls whether times are spoken in the 24-hour time system or the 12-hour, am/pm system. When used in combination with the 'speak-date' property, this allows elements with an attribute containing an ISO 8601 format date/time attribute to be presented in a flexible manner.

An additional aural property, speak-header-cell, is described in the capter on tables