19 Aural style sheets

19.1 Introduction to aural style sheets

The aural rendering of a document, already commonly used by the blind and print-impaired communities, combines speech synthesis and "audio icons" Often such aural presentation occurs by converting the document to plain text and feeding this to a screen reader -- software or hardware that simply reads all the characters on the screen. This results in less effective presentation than would be the case if the document structure were retained. Style Sheet properties for aural presentation may be used together with visual properties (mixed media) or as an aural alternative to visual presentation.

Besides the obvious accessibility advantages, there are other large markets for aural presentation, including in-car use, industrial and medical documentation systems (intranets), home entertainment, and to help illiterate users.

19.2 Volume properties: 'volume'

'volume'

Property name:	'volume'
Value:	<number> \| <percentage> \| silent \| x-soft \| soft \| medium \| loud \| x-loud \| inherit
Initial:	medium
Applies to:	all elements
Inherited:	yes
Percentage values:	relative to inherited value
Media groups:	aural

Volume refers to the median volume of the waveform. In other words, a highly inflected voice at a volume of 50 might peak well above that. The overall values are likely to be human adjustable for comfort, for example with a physical volume control (which would increase both the 0 and 100 values proportionately); what this property does is adjust the dynamic range.

Values have the following meanings:

<number>: Any number between '0' and '100'. '0' represents the minimum audible volume level and 100 corresponds to the maximum comfortable level.
<percentage>: Percentage values are calculated relative to the inherited value, and are then clipped to the range '0' to '100'.
silent: No sound at all. Note. The value '0' does not mean the same as 'silent'.
x-soft: Same as '0'.
soft: Same as '25'.
medium: Same as '50'.
loud: Same as '75'.
x-loud: Same as '100'.

User agents should allow the values corresponding to '0' and '100' to be set by the listener. No one setting is universally applicable; suitable values depend on the equipment in use (speakers, headphones), the environment (in car, home theater, library) and personal preferences. Some examples:

A browser for in-car use has a setting for when there is lots of background noise. '0' would map to a fairly high level and '100' to a quite high level. The speech is easily audible over the road noise but the overall dynamic range is compressed. Cars with better insulation might allow a wider dynamic range.
Another speech browser is being used in an apartment, late at night, or in a shared study room. '0' is set to a very quiet level and '100' to a fairly quiet level, too. As with the first example, there is a low slope; the dynamic range is reduced. The actual volumes are low here, whereas they were high in the first example.
In a quiet and isolated house, an expensive hi-fi home theater setup. '0' is set fairly low and '100' to quite high; there is wide dynamic range.

The same author style sheet could be used in all cases, simply by mapping the '0' and '100' points suitably at the client side.

19.3 Speaking properties: 'speak'

'speak'

Property name:	'speak'
Value:	normal \| none \| spell-out \| inherit
Initial:	normal
Applies to:	all elements
Inherited:	yes
Percentage values:	N/A
Media groups:	aural

This property specifies whether text will be rendered aurally and if so, in what manner (somewhat analogous to the 'display' property). The possibles values are:

none: Suppresses aural rendering so that, unless overridden recursively, the element and its children require no time to render.
normal: Uses regular language-dependent pronunciation rules for rendering an element and its children.
spell-out: Spells the text one letter at a time (useful for acronyms and abbreviations).

Note the difference between an element whose 'volume' property has a value of 'silent' and an element whose 'speak' property has the value 'none'. The former takes up the same time as if it had been spoken, including any pause before and after the element, but no sound is generated. This may be used in language teaching applications, for example. A pause is generated for the pupil to speak the element themselves. Note that since the value of this property is inherited, child elements will also be silent. Child elements may however set the volume to a non-silent value and will then be spoken. On the other hand, elements for which the 'speak' property has the value 'none' are not spoken and take no time. Child elements may however override this value and may be spoken normally.

19.4 Pause properties: 'pause-before', 'pause-after', and 'pause'

'pause-before'

Property name:	'pause-before'
Value:	<time> \| <percentage> \| inherit
Initial:	depends on user agent
Applies to:	all elements
Inherited:	no
Percentage values:	see prose
Media groups:	aural

'pause-after'

Property name:	'pause-after'
Value:	<time> \| <percentage> \| inherit
Initial:	depends on user agent
Applies to:	all elements
Inherited:	no
Percentage values:	see prose
Media groups:	aural

These properties specify a pause to be observed before (or after) speaking an element's content. Values have the following meanings:

<time>: Expresses the pause in absolute time units (seconds and milliseconds).
<percentage>: Refers to the inverse of the value of the 'speech-rate' property. For example, if the speech-rate is 120 words per minute (i.e., a word takes half a second, or 500ms) then a 'pause-before' of 100% means a pause of 500 ms and a 'pause-before' of 20% means 100ms.

Authors should use relative units to create more robust style sheets in the face of large changes in speech-rate.

'pause'

Property name:	'pause'
Value:	[ [<time> \| <percentage>]{1,2} ] \| inherit
Initial:	depends on user agent
Applies to:	all elements
Inherited:	no
Percentage values:	see descriptions of 'pause-before' and 'pause-after'
Media groups:	aural

The 'pause' property is a shorthand for setting 'pause-before' and 'pause-after'. If two values are given, the first value is 'pause-before' and the second is 'pause-after'. If only one value is given, it applies to both properties.

Examples:

  H1 { pause: 20ms } /* pause-before: 20ms; pause-after: 20ms */
  H2 { pause: 30ms 40ms } /* pause-before: 30ms; pause-after: 40ms */
  H3 { pause-after: 10ms } /* pause-before: ?; pause-after: 10ms */

19.5 Cue properties: 'cue-before', 'cue-after', and 'cue'

'cue-before'

Property name:	'cue-before'
Value:	<uri> \| none \| inherit
Initial:	none
Applies to:	all elements
Inherited:	no
Percentage values:	N/A
Media groups:	aural

'cue-after'

Property name:	'cue-after'
Value:	<uri> \| none \| inherit
Initial:	none
Applies to:	all elements
Inherited:	no
Percentage values:	N/A
Media groups:	aural

Auditory icons are another way to distinguish semantic elements. Sounds may be played before, and/or after the element to delimit it. Values have the following meanings:

<uri>: The URI designates an audio icon resource.
none: No audio icon is specified.

For example:

  A {cue-before: url(bell.aiff); cue-after: url(dong.wav) }
  H1 {cue-before: url(pop.au); cue-after: url(pop.au) }

'cue'

Property name:	'cue'
Value:	[ <'cue-before'> \|\| <'cue-after'> ] \| inherit
Initial:	not defined for shorthand properties
Applies to:	all elements
Inherited:	no
Percentage values:	N/A
Media groups:	aural

The 'cue' property is a shorthand for setting 'cue-before' and 'cue-after'. If two values are given, the first value is 'cue-before' and the second is 'cue-after'. If only one value is given, it applies to both properties.

The following two rules are equivalent:

  H1 {cue-before: url(pop.au); cue-after: url(pop.au) }
  H1 {cue: url(pop.au) }

19.6 Mixing properties: 'play-during'

'play-during'

Property name:	'play-during'
Value:	<uri> \| mix? repeat? \| auto \| none \| inherit
Initial:	auto
Applies to:	all elements
Inherited:	no
Percentage values:	N/A
Media groups:	aural

Similar to the 'cue-before' and 'cue-after' properties, this property specifies a sound to be played as a background while an element's content is spoken. Values have the following meanings:

<uri>: The sound designated by this <uri> is played as a background while the element's content is spoken.
mix: When present, this keyword means that the sound inherited from the parent element's 'play-during' property continues to play and the sound designated by the <uri> is mixed with it. If 'mix' is not specified, the sound replaces the sound of the parent element.
repeat: When present, this keyword means that the sound will repeat if it is too short to fill the entire duration of the element. Otherwise, the sound plays once and then stops. This is similar to the background repeat properties in CSS2. If the sound is too long for the element, it is clipped once the element is spoken.
auto: The sound of the parent element continues to play (it is not restarted, which would have been the case if this property had been inherited).
none: Means that there is silence - the sound of the parent element (if any) is silent during the current element and continues after the current element.

Examples:

  BLOCKQUOTE.sad {play-during: url(violins.aiff) }
  BLOCKQUOTE Q {play-during: url(harp.wav) mix}
  SPAN.quiet {play-during: none }

If a stereo icon is dereferenced, the central point of the stereo pair should be placed at the azimuth for that element and the left and right channels should be placed to either side of this position.

19.7 Spatial properties: 'azimuth' and 'elevation'

Spatial audio is an important stylistic property for aural presentation. It provides a natural way to tell several voices apart, as in real life (people rarely all stand in the same spot in a room). Stereo speakers produce a lateral sound stage. Binaural headphones or the increasingly popular 5-speaker home theater setups can generate full surround sound, and multi-speaker setups can create a true three-dimensional sound stage. VRML 2.0 also includes spatial audio, which implies that in time consumer-priced spatial audio hardware will become more widely available.

'azimuth'

Property name:	'azimuth'
Value:	<angle> \| [[ left-side \| far-left \| left \| center-left \| center \| center-right \| right \| far-right \| right-side ] \|\| behind ] \| leftwards \| rightwards \| inherit
Initial:	center
Applies to:	all elements
Inherited:	yes
Percentage values:	N/A
Media groups:	aural

Values have the following meanings:

<angle>: Position is described in terms of degrees, within the range '-360deg' to '360deg'. The value '0deg' means directly ahead in the center of the sound stage. '90deg' is to the right, '180deg' behind, and '270deg' (or, equivalently and more conveniently, '-90deg') to the left.
left-side: Same as '270deg'. With 'behind', '270deg'.
far-left: Same as '300deg'. With 'behind', '240deg'.
left: Same as '320deg'. With 'behind', '220deg'.
center-left: Same as '340deg'. With 'behind', '200deg'.
center: Same as '0deg'. With 'behind', '180deg'.
center-right: Same as '20deg'. With 'behind', '160deg'.
right: Same as '40deg'. With 'behind', '140deg'.
far-right: Same as '60deg'. With 'behind', '120deg'.
right-side: Same as '90deg'. With 'behind', '90deg'.
leftwards: Moves the sound to the left, relative to the current angle. More precisely, subtracts 20 degrees. Arithmetic is carried out modulo 360 degrees. Note that 'leftwards' is more accurately described as "turned counter-clockwise," since it always subtracts 20 degrees, even if the inherited azimuth is already behind the listener (in which case the sound actually appears to move to the right).
rightwards: Moves the sound to the right, relative to the current angle. More precisely, adds 20 degrees. See 'leftwards' for arithmetic.

This property is most likely to be implemented by mixing the same signal into different channels at differing volumes. It might also use phase shifting, digital delay, and other such techniques to provide the illusion of a sound stage. The precise means used to achieve this effect and the number of speakers used to do so are user agent-dependent; this property merely identifies the desired end result.

Examples:

  H1   { azimuth: 30deg }          
  TD.a { azimuth: far-right }          /*  60deg */
  #12  { azimuth: behind far-right }   /* 120deg */
  P.comment { azimuth: behind }        /* 180deg */

If spatial-azimuth is specified and the output device cannot produce sounds behind the listening position, user agents should convert values in the rearwards hemisphere to forwards hemisphere values. One method is as follows:

if 90deg < x <= 180deg then x := 180deg - x
if 180deg < x <= 270deg then x := 540deg - x

'elevation'

Property name:	'elevation'
Value:	<angle> \| below \| level \| above \| higher \| lower \| inherit
Initial:	level
Applies to:	all elements
Inherited:	yes
Percentage values:	N/A
Media groups:	aural

Values of this property have the following meanings:

<angle>: Specifies the elevation as an angle, between '-90deg' and '90deg'. '0deg' means on the forward horizon, which loosely means level with the listener. '90deg' means directly overhead and '-90deg' means directly below.
below: Same as '-90deg'.
level: Same as '-0deg'.
above: Same as '90deg'.
higher: Adds 10 degrees to the current elevation.
lower: Subtracts 10 degrees from the current elevation.

The precise means used to achieve this effect and the number of speakers used to do so are undefined. This property merely identifies the desired end result.

Examples:

        
  H1   { elevation: above }   
  TR.a { elevation: 60deg }
  TR.b { elevation: 30deg }
  TR.c { elevation: level }

19.8 Voice characteristic properties: 'speech-rate', 'voice-family', 'pitch', 'pitch-range', 'stress', and 'richness'

'speech-rate'

Property name:	'speech-rate'
Value:	<number> \| x-slow \| slow \| medium \| fast \| x-fast \| faster \| slower \| inherit
Initial:	medium
Applies to:	all elements
Inherited:	yes
Percentage values:	N/A
Media groups:	aural

This property specifies the speaking rate. Note that both absolute and relative keyword values are allowed (compare with 'font-weight'). Values have the following meanings:

<number>: Specifies the speaking rate in words per minute, a quantity that varies somewhat by language but is nevertheless widely supported by speech synthesizers.
x-slow: Same as ?
slow: Same as ?
medium: Same as ? Refers to the user's preferred speech-rate setting.
fast: Same as ?
x-fast: Same as ?
faster: Adds ? to current speech rate.
slower: Subtracts ? to current speech rate.

'voice-family'

Property name:	'voice-family'
Value:	[[<specific-voice> \| <generic-voice> ],]* [<specific-voice> \| <generic-voice> ] \| inherit
Initial:	depends on user agent
Applies to:	all elements
Inherited:	yes
Percentage values:	N/A
Media groups:	aural

The value is a comma-separated, prioritized list of voice family names (compare with 'font-family'). Values have the following meanings:

<generic-voice>: Values are voice families (e.g., male, female, child).
<specific-voice>: Values are specific instances (e.g., comedian, trinoids, carlos, lani).

Examples:

  H1 { voice-family: announcer, male }
  P.part.romeo  { voice-family: romeo, male }
  P.part.juliet { voice-family: juliet, female }

'pitch'

Property name:	'pitch'
Value:	<frequency> \| x-low \| low \| medium \| high \| x-high \| inherit
Initial:	medium
Applies to:	all elements
Inherited:	yes
Percentage values:	N/A
Media groups:	aural

Specifies the average pitch of the speaking voice. Values have the following meanings:

<frequency>: Specifies the average pitch of the speaking voice in hertz (Hz).
x-low: Same as ?
low: Same as ?
medium: Same as ?
high: Same as ?
x-high: Same as ?

'pitch-range'

Property name:	'pitch-range'
Value:	<number> \| inherit
Initial:	50
Applies to:	all elements
Inherited:	yes
Percentage values:	N/A
Media groups:	aural

Specifies variation in average pitch. Values have the following meanings:

<number>: A pitch range of 0 produces a flat, monotonic voice. A pitch range of 50 produces normal inflection. Pitch ranges greater than 50 produce animated voices.

'stress'

Property name:	'stress'
Value:	<number> \| inherit
Initial:	50
Applies to:	all elements
Inherited:	yes
Percentage values:	N/A
Media groups:	aural

Specifies the level of stress (assertiveness or emphasis) of the speaking voice. English is a stressed language, and different parts of a sentence are assigned primary, secondary or tertiary stress. The value of 'stress' controls the amount of inflection that results from these stress markers. Values have the following meanings:

<number>: Increasing the value of this property results in the speech being more strongly inflected. It is, in a sense, a companion to the 'pitch-range' property and is provided to allow developers to exploit higher-end auditory displays.

'richness'

Property name:	'richness'
Value:	<number> \| inherit
Initial:	50
Applies to:	all elements
Inherited:	yes
Percentage values:	N/A
Media groups:	aural

Specifies the richness (brightness) of the speaking voice. Values have the following meanings:

<number>: The effect of increasing richness is to produce a voice that carries. Reducing richness produces a soft, mellifluous voice.

19.9 Speech properties: 'speak-punctuation', 'speak-date', 'speak-numeral', and 'speak-time'

Note. The following four properties are preliminary and discussion on them is invited.

An additional speech property, speak-header, is described in the chapter on tables

'speak-punctuation'

Property name:	'speak-punctuation'
Value:	code \| none \| inherit
Initial:	none
Applies to:	all elements
Inherited:	yes
Percentage values:	N/A
Media groups:	aural

This property specifies how punctuation is spoken. Values have the following meanings:

code: Punctuation such as semicolons, braces, and so on are to be spoken literally.
none: Punctuation is not to be spoken, but instead rendered naturally as various pauses.

'speak-date'

Property name:	'speak-date'
Value:	mdy \| dmy \| ymd \| inherit
Initial:	depends on user agent
Applies to:	all elements
Inherited:	yes
Percentage values:	N/A
Media groups:	aural

This property controls how dates are spoken. Values have the following meanings:

mdy: Month-Day-Year (common in the United States).
dmy: Day-Month-Year (common in Europe).
ymd: Year-Month-Day.

This property would be useful, for example, when combined with an XML element used to identify dates, such as:

   <PARA>The campaign started on <DATE value="1874-10-21"/>
    and finished <DATE value="1874-10-28/"></PARA>

'speak-numeral'

Property name:	'speak-numeral'
Value:	digits \| continuous \| none \| inherit
Initial:	none
Applies to:	all elements
Inherited:	yes
Percentage values:	N/A
Media groups:	aural

This property controls how numerals are spoken. Values have the following meanings:

digits: Speak the numeral as individual digits. Thus, "237" is spoken "Two Three Seven".
continuous: Speak the numeral as a full number. Thus, "237" is spoken "Two hundred thirty seven". Word representations are language-dependent.
none: [What does this mean?]

'speak-time'

Property name:	'speak-time'
Value:	24 \| 12 \| none \| inherit
Initial:	none
Applies to:	all elements
Inherited:	yes
Percentage values:	N/A
Media groups:	aural

This property controls how times are spoken. Values have the following meanings:

24: Use the 24-hour time system.
12: Use the 12-hour am/pm time system.
none: [What does this mean?]

When used in combination with the 'speak-date' property, this allows elements with an attribute containing an ISO 8601 format date/time attribute to be presented in a flexible manner.