CSS3 Speech Module

W3C Working Draft 14 May 2003

This version:: http://www.w3.org/TR/2003/WD-css3-speech-20030514
Latest version:: http://www.w3.org/TR/css3-speech
Previous version:: no previous version
Editor:: Dave Raggett (W3C)
Daniel Glazman (Netscape/AOL)

Abstract

CSS (Cascading Style Sheets) is a language for describing the rendering of HTML and XML documents on screen, on paper, in speech, etc. CSS define aural properties that give control over rendering XML to speech. This draft describes the text to speech properties proposed for CSS level 3. These are designed for match the model described in the Speech Synthesis Markup Language (SSML).

Status of this document

This document is a draft of one of the "modules" for the upcoming CSS3 specification.

This document is a working draft of the CSS working group which is part of the style activity (see summary). It has been developed in cooperation with the Voice Browser working group.

The CSS working group would like to receive feedback: comments on this draft may be sent to the editors, discussion takes place on the (archived) public mailing list www-style@w3.org (see instructions). W3C Members can also send comments directly to the CSS working group.

This working draft may be updated, replaced or rendered obsolete by other W3C documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". Its publication does not imply endorsement by the W3C membership or the CSS Working Group (members only).

Patent disclosures relevant to CSS may be found on the Working Group's public patent disclosure page.

To find the latest version of this working draft, please follow the "Latest version" link above, or visit the list of W3C Technical Reports.

A list of current W3C Recommendations and other technical documents including Working Drafts and Notes can be found at http://www.w3.org/TR.

Dependencies on other modules
Introduction
Volume properties: 'voice-volume', 'voice-balance'
Speaking properties: ’speak’
Pause properties: ’pause-before’, ’pause-after’, and ’pause’
Cue properties: ’cue-before’, ’cue-after’, and ’cue’
Voice characteristic properties: ’voice-rate’, ’voice-family’, ’voice-pitch’, ’voice-pitch-range’, and ’voice-stress’
Voice duration properties: ’voice-duration’
Phonetic properties: ’phonemes’ and ’@phonetic-alphabet’
Interpretation properties: ’interpret-as’

Appendix A : Changes from CSS 2 specification
Appendix B : Profiles
Appendix C : Acknowledgements
Appendix D : References

Dependencies on other modules

This CSS3 module depends on the following other CSS3 modules:

Syntax and Grammar
Values and Units
Value Assignment / Cascade / Inheritance
Tables

It has non-normative (informative) references to the following other CSS3 modules:

Selectors
Generated content / markers

Introduction

The speech rendering of a document, already commonly used by the blind and print-impaired communities, combines speech synthesis and "auditory icons." Often such aural presentation occurs by converting the document to plain text and feeding this to a screen reader -- software or hardware that simply reads all the characters on the screen. This results in less effective presentation than would be the case if the document structure were retained. Style sheet properties for text to speech may be used together with visual properties (mixed media) or as an aural alternative to visual presentation.

Besides the obvious accessibility advantages, there are other large markets for listening to information, including in-car use, industrial and medical documentation systems (intranets), home entertainment, and to help users learning to read or who have difficulty reading.

When using aural properties, the canvas consists of a two chanel stereo space and a temporal space (one may specify sounds before, during, and after other sounds). The CSS properties also allow authors to vary the quality of synthesized speech (voice type, frequency, inflection, etc.).

h1, h2, h3, h4, h5, h6 {
    voice-family: paul;
    voice-stress: 20;
    cue-before: url("ping.au")
}
P.heidi { voice-: left; voice-family: female }
P.peter { voice-: right; voice-family: male }
P.goat  { voice-volume: soft }

This will direct the speech synthesizer to speak headers in a voice (a kind of "audio font") called "paul". Before speaking the headers, a sound sample will be played from the given URL. Paragraphs with class "heidi" will appear to come from front left (if the sound system is capable of stereo), and paragraphs of class "peter" from the right. Paragraphs with class "goat" will be played softly.

Volume properties: 'voice-volume' and 'voice-balance'

'voice-volume'
Value:	<number> \| <percentage> \| silent \| x-soft \| soft \| medium \| loud \| x-loud \| louder \| softer \| inherit
Initial:	medium
Applies to:	all elements
Inherited:	yes
Percentages:	refer to inherited value
Media:	speech

Voice-Volume refers to the median volume of the waveform. In other words, a highly inflected voice at a volume of 50 might peak well above that. The overall values are likely to be human adjustable for comfort.

Values have the following meanings:

<number>: Any number between '0' and '100'. '0' represents silence (the minimum level), and 100 corresponds to the maximum level. This is intended to correspond to the conventional volume scale found on consumer audio equipment.
<percentage>: Percentage values are calculated relative to the inherited value, and are then clipped to the range '0' to '100'.
silent: Same as '0'.
x-soft: Same as '10'.
soft: Same as '25'.
medium: Same as '50'.
loud: Same as '75'.
x-loud: Same as '100'.

User agents should allow the level corresponding to '100' to be set by the listener. No one setting is universally applicable; suitable values depend on the equipment in use (speakers, headphones), and the environment (in car, home theater, library) and personal preferences.

Previous drafts defined '0' as the minimum audible level and '100' as the maximum comfortable level, distinguishing '0' (minimum perceptible level) from 'silent' (no perceptible sound). Unfortunately, this isn't practical with conventional speech synthesizers and audio mixers. The decision was thus taken to switch to a model matching conventional audio controls on consumer audio equipment including home computers.

The two values 'higher' and 'softer' have been removed from this specification to allow interoperability with the SSML specification. SSML does not have relative volume values and producing SSML from XML+CSS would then require to have a speech engine to compute relative values into absolute values.

'voice-balance'
Value:	<number> \| left \| center \| right \| inherit
Initial:	center
Applies to:	all elements
Inherited:	yes
Percentages:	N/A
Media:	speech

Voice-Balance refers to the balance between left and right channels, and presumes a two channel (stereo) model that is widely supported on consumer audio equipment.

Values have the following meanings:

<number>: Any number between '-100' and '100'. For '-100' only the left channel is audible. Simarly for '100' or '+100' only the right channel is audible. For '0' both channels have the same level, so that the speech appears to becoming from the center.
left: Same as '-100'.
center: Same as '0'.
right: Same as '100' or '+100'.
leftwards: Moves the sound to the left, relative to the inherited voice balance. More precisely, subtracts 5 arbitrary units and clip the resulting value to the range '-100' and '100'.
rightwards: Moves the sound to the right, relative to the inherited voice balance. More precisely, adds 5 arbitrary units and clip the resulting value to the range '-100' and '100'.

Many speech synthesisers only support a single channel. The voice-balance property can then be treated as part of a post synthesis mixing step. This is where speech is mixed with other audio sources.

Previous drafts define a alti/azimuth model. This has been replaced by a more conventional model that reflects the two channel stereo mixing support found in most consumer audio equipment, e.g. the volume control in Microsoft Windows, and the Gnome audio mixer in Linux.

Speaking properties: 'speak'

An additional speech property, speak-header, is described in the chapter on tables.

'speak'
Value:	none \| normal \| spell-out \| digits \| literal-punctuation \| no-punctuation \| inherit
Initial:	normal
Applies to:	all elements
Inherited:	yes
Percentages:	N/A
Media:	speech

This property specifies whether text will be rendered aurally and if so, in what manner. The possible values are:

none: Suppresses aural rendering so that the element requires no time to render. Note, however, that descendants may override this value and will be spoken. (To be sure to suppress rendering of an element and its descendants, use the 'display' property).
normal: Uses language-dependent pronunciation rules for rendering an element and its children. Punctuation is not to be spoken, but instead rendered naturally as various pauses.
spell-out: Spells the text one letter at a time (useful for acronyms and abbreviations). In languages where accented characters are rare, it is permitted to drop accents in favor of alternative unaccented spellings. As as example, in English, the word "rôle" can also be written as "role". A conforming implementation would thus be able to spell-out "rôle" as "R O L E".
digits: Speak numbers one digit at a time, for instance, "twelve" would be spoken as "one two", and "31" as "three one".
literal-punctuation: Similar as 'normal' value but punctuation such as semicolons, braces, and so on are to be spoken literally.
no-punctuation: Similar as 'normal' value but punctuation is not to be spoken nor rendered as various pauses.

Note the difference between an element whose 'volume' property has a value of 'silent' and an element whose 'speak' property has the value 'none'. The former takes up the same time as if it had been spoken, including any pause before and after the element, but no sound is generated. The latter requires no time and is not rendered (though its descendants may be).

Speech synthesisers are knowledgeable about what is a number and what isn't. The speak property gives authors the means to control how the synthesiser renders the numbers it discovers in the source text.

Editor's note: 'cardinal' and 'ordinal' values were dropped for the moment to avoid the difficulties involved in and adequate treatment of declination. 'speak-punctuation' and 'speak' properties were merged to be harmonize with SSML.

Editor's note: the ACSS speak-numeral property has been subsumed by the revised speak property. The value 'code' has been replaced by 'all' as this has broader use than just speaking program code.

Pause properties: 'pause-before', 'pause-after', and 'pause'

'pause-before'
Value:	<time> \| <percentage> \| inherit
Initial:	depends on user agent
Applies to:	all elements
Inherited:	no
Percentages:	see prose
Media:	speech

'pause-after'
Value:	<time> \| <percentage> \| inherit
Initial:	depends on user agent
Applies to:	all elements
Inherited:	no
Percentages:	see prose
Media:	speech

These properties specify a pause to be observed before (or after) speaking an element's content. Values have the following meanings:

<time>: Expresses the pause in absolute time units (seconds and milliseconds).
<percentage>: Refers to the inverse of the value of the 'speech-rate' property. For example, if the speech-rate is 120 words per minute (i.e., a word takes half a second, or 500ms) then a 'pause-before' of 100% means a pause of 500 ms and a 'pause-before' of 20% means 100ms.

The pause is inserted between the element's content and any 'cue-before' or 'cue-after' content.

Authors should use relative units to create more robust style sheets in the face of large changes in speech-rate.

'pause'
Value:	[ <'pause-before'> \|\| <'pause-after'> ] \| inherit
Initial:	depends on user agent
Applies to:	all elements
Inherited:	no
Percentages:	see descriptions of ’pause-before’ and ’pause-after’
Media:	speech

Editor's note : the value of this shorthand property has been modified. The CSS 2 spec contains an explicit definition, which is not the common case in the spec for shorthand properties. See 'cue' below for instance, which remains unchanged.

The 'pause' property is a shorthand for setting 'pause-before' and 'pause-after'. If two values are given, the first value is 'pause-before' and the second is 'pause-after'. If only one value is given, it applies to both properties.

H1 { pause: 20ms } /* pause-before: 20ms; pause-after: 20ms */
H2 { pause: 30ms 40ms } /* pause-before: 30ms; pause-after: 40ms */
H3 { pause-after: 10ms } /* pause-before: ?; pause-after: 10ms */

Cue properties: 'cue-before', 'cue-after', and 'cue'

'cue-before'
Value:	<url> \| none \| inherit
Initial:	none
Applies to:	all elements
Inherited:	no
Percentages:	N/A
Media:	speech

'cue-after'
Value:	<url> \| none \| inherit
Initial:	none
Applies to:	all elements
Inherited:	no
Percentages:	N/A
Media:	speech

Auditory icons are another way to distinguish semantic elements. Sounds may be played before and/or after the element to delimit it. Values have the following meanings:

<uri>: The URI must designate an auditory icon resource. If the URI resolves to something other than an audio file, such as an image, the resource should be ignored and the property treated as if it had the value 'none'.
none: No auditory icon is specified.

A {cue-before: url("bell.aiff"); cue-after: url("dong.wav") }
H1 {cue-before: url("pop.au"); cue-after: url("pop.au") }

'cue'
Value:	[ <'cue-before'> \|\| <'cue-after'> ]] \| inherit
Initial:	not defined for shorthand properties
Applies to:	all elements
Inherited:	no
Percentages:	N/A
Media:	speech

The 'cue' property is a shorthand for setting 'cue-before' and 'cue-after'. If two values are given, the first value is 'cue-before' and the second is 'cue-after'. If only one value is given, it applies to both properties.

The following two rules are equivalent:

H1 {cue-before: url("pop.au"); cue-after: url("pop.au") }
H1 {cue: url("pop.au") }

If a user agent cannot render an auditory icon (e.g., the user's environment does not permit it), we recommend that it produce an alternative cue (e.g., popping up a warning, emitting a warning sound, etc.)

Please see the sections on the :before and :after pseudo-elements for information on other content generation techniques.

Voice characteristic properties: 'voice-family', 'voice-rate', 'voice-pitch', 'voice-pitch-range', and 'voice-stress'

'voice-family'
Value:	[[<specific-voice> \| [<age>] <generic-voice>] [<number>],]* [<specific-voice> \| [<age>] <generic-voice>] [<number>] \| inherit
Initial:	depends on user agent
Applies to:	all elements
Inherited:	yes
Percentages:	N/A
Media:	speech

The value is a comma-separated, prioritized list of voice family names (compare with 'font-family'). Values have the following meanings:

<specific-voice>: Values are specific instances (e.g., comedian, mary, carlos, "valley girl").
<age>: Possible values are 'child', 'young' and 'old'.
<generic-voice>: Values are voice families. Possible values are 'male' and 'female'.
<number>: Indicates a preferred variant of the other voice characteristics. (e.g. the second or next male voice). Possible values are positive integers.

h1 { voice-family: announcer, old male }
p.part.romeo  { voice-family: romeo, young male }
p.part.juliet { voice-family: juliet, female }
p.part.mercutio { voice-family: male 2 }
p.part.tybalt { voice-family: male 3 }
p.part.nurse { voice-family: child female }

Names of specific voices may be quoted, and indeed must be quoted if any of the words that make up the name does not conform to the syntax rules for identifiers. It is also recommended to quote specific voices with a name consisting of more than one word. If quoting is omitted, any whitespace characters before and after the voice name are ignored and any sequence of whitespace characters inside the voice name is converted to a single space.

Speech platforms should do their best to provide good quality voices and may override the voice characteristics when a voice with the requested characteristics is unavailable, or would significantly reduce the perceived quality or intelligibility.

Conforming implementations should process documents according to the language. This is indicated by the xml:lang attribute as per the XML 1.0 specification, and is inherited by nested elements until overridden by a further xml:lang attribute. A document author should be aware that intra-sentential language changes may not be supported on all platforms.

The speech output platform largely determines behavior in the case that a document requires speech output in a language not supported by the speech output platform. In any case, if a value for xml:lang specifying an unsupported language is encountered, a conforming implementation should attempt to continue processing and should also notify the hosting environment in that case.

A language change often necessitates a change in the voice. Where the platform does not have the same voice in both the enclosing and enclosed languages it should select a new voice with the inherited voice characteristics. Any change in voice may effect implementation (and language) dependent characteristics.

The xml:lang attribute in the document markup provides a limited means to influence the choice of an accent through the choice of the country tag, for instance, "en_uk" or "en_us". This doesn't help for with-in country variations such as Scottish or Welsh accents for English, nor does it help with foreign accents such as French spoken with an English accent. The current solution is to request a specific voice known to have the desired characteristics. Over time, it is possible that naming conventions may emerge to provide a cross platform solution.

Editor's note: the treatment of language is taken from the SSML specification. My assumption is that "fr-en"is inappropriate as a means to specifiy French spoken with an English accent, but it would be nice to have this confirmed. Should the syntax for this property be relaxed to allow age as a modifier before a specific voice?

'voice-rate'
Value:	<number> x-slow \| slow \| medium \| fast \| x-fast \| slower \| faster \|inherit
Initial:	medium
Applies to:	all elements
Inherited:	yes
Percentages:	refer to inherited value
Media:	speech

This property specifies the speaking rate. Note that both absolute and relative keyword values are allowed (compare with 'font-size'). Values have the following meanings:

<number>: Specifies the speaking rate in words per minute, a quantity that varies somewhat by language but is nevertheless widely supported by speech synthesizers.
x-slow: Very slow. For instance 80 words per minute in English.
slow: Slow. For instance 120 words per minute in English.
medium: Normal speech rate for the language. For instance 180 - 200 words per minute in English.
fast: Fast. For instance 300 words per minute in English.
x-fast: Very fast. For instance 500 words per minute in English.

Editor's note : the list of values above has been modified. That was in CSS2 issues list : 180-200 words per minute (medium) is ok for English but other languages use very different values!!! In French for instance, 180-200 is really very fast, like a radio speaker commenting a horse race...

Editor's note : The values 'faster' and 'slower' were removed to be consistent with SSML.

'voice-pitch'
Value:	<number> \| x-low \| low \| medium \| high \| x-high \| inherit
Initial:	medium
Applies to:	all elements
Inherited:	yes
Percentages:	refer to inherited value
Media:	speech

Specifies the average pitch (a frequency) of the speaking voice. The average pitch of a voice depends on the voice family. For example, the average pitch for a standard male voice is around 120Hz, but for a female voice, it's around 210Hz.

Values have the following meanings:

<number>: Specifies the average pitch of the speaking voice in Hertz.
x-low, low, medium, high, x-high: These values do not map to absolute frequencies since these values depend on the voice family. User agents should map these values to appropriate frequencies based on the voice family and user environment. However, user agents must map these values in order (i.e., 'x-low' is a lower frequency than 'low', etc.).

SSML allows for relative values in semitones. This would necessitate a new CSS unit "st". How valuable is this? What about the alternative of providing 'higher' and 'lower' for consistency with other related voice properties?

'voice-pitch-range'
Value:	<number> \| low \| medium \| high \| inherit
Initial:	50
Applies to:	all elements
Inherited:	yes
Percentages:	refer to inherited value
Media:	speech

Specifies variation in average pitch. The perceived pitch of a human voice is determined by the fundamental frequency and typically has a value of 120Hz for a male voice and 210Hz for a female voice. Human languages are spoken with varying inflection and pitch; these variations convey additional meaning and emphasis. Thus, a highly animated voice, i.e., one that is heavily inflected, displays a high pitch range. This property specifies the range over which these variations occur, i.e., how much the fundamental frequency may deviate from the average pitch.

Values have the following meanings:

<number>: The pitch range in Hertz. Low ranges produce a flat, monotonic voice. A high range produces animated voices.
low: A flat monotonic voice.
medium: A normal voice.
high: An highly animated voice.

The pitch ranges for low, medium and high are language specific.

Is it worth providing additional named values, such as 'x-low', 'x-high', 'higher' and 'lower', for consistency with other voice related properties?

'voice-stress'
Value:	strong \| moderate \| none \| reduced \| inherit
Initial:	moderate
Applies to:	all elements
Inherited:	yes
Percentages:	N/A
Media:	speech

Indicates the strength of emphasis to be applied. The amount depends on the language being spoken.

Values have the following meanings:

strong: Apply a strong emphasis.
moderate: Apply a moderate emphasis.
none: Inhibit the synthesizer from emphasizing words it would normally emphasize.
reduced: Effectively the opposite of emphasizing a word. For example, when the phrase "going to" is reduced it may be spoken as "gonna".

Voice duration property: 'voice-duration'

'voice-duration'
Value:	<time>
Initial:	implicit
Applies to:	all elements
Inherited:	no
Percentages:	N/A
Media:	speech

This allows authors to specify how long they want a given element to be rendered. This property overrides the 'voice-rate' property. Values have the following meanings:

<time>: Specifies a value in seconds or milliseconds for the desired time to take to speak the element contents, for instance, "250ms", or "3s".

Phonetics: 'phonemes', '@phonetic-alphabet and 'content'

'phonemes'
Value:	<string>
Initial:	implicit
Applies to:	all elements
Inherited:	no
Percentages:	N/A
Media:	speech

This allows authors to specify a phonetic pronunciation for the text contained by the corresponding element. The default alphabet for the pronunciation string is the International Phonetic Alphabet ("ipa"). The phonetic alphabet can be explicitly specified using the @phonetic-alphabet rule, for instance:

@phonetic-alphabet "ipa";

#tomato { phonemes: "tɒmɑtoʊ" }

This will direct the speech synthesizer to replace the default pronunciation by the corresponding sequence of phonemes in the designated alphabet.

Sometimes, authors will want to specify a mapping from the source text into another string prior to the application of the regular pronunciation rules. This may be used for uncommon acronyms which are unlikely to be recognized by the synthesizer. The 'content' property can be used to replace one string by another. In the following example, the acronym element is rendered using the content of the title attribute instead of the element's content:

  acronym { content: attr(title) }
  ...

  <acronym title="world wide web consortium">W3C</acronym>

This replaces the content of the selected element by the string "world wide web consortium".

Editor's note: The alphabet is specified via an at-rule to avoid problems with inappropriate cascades that can occur of the alphabet was set via a property. It might be helpful to provide an phonetic example using an ASCII based phonetic alphabet due to the difficulty some people have viewing IPA characters.

Interpretation property: 'interpret-as'

'interpret-as'
Value:	<date> \| <time> \| currency \| measure \| telephone \| address \| name \| net
Initial:	implicit
Applies to:	all elements
Inherited:	no
Percentages:	N/A
Media:	speech

This provides a hint to the speech platform as to how to interpret the corresponding element's content and is useful when the content is ambigous and liable to be misinterpreted. Values have the following meanings:

<date>: Specifies a date using the same syntax as SSML. For instance, "date(dmy)" when applied to "12/11/2002" would be interpreted as 12th November 2002.
<time>: Specifies a time using the same syntax as SSML. For instance, "time(hm)" when applied to "06:30" would be interpreted as 6 hours and 30 minutes.
currency: A hint that the element's content is a currency value.
measure: A hint that the element's content is a measurement.
telephone: A hint that the element's content is a telephone number.
address: A hint that the element's content is an address.
name: A hint that the element's content is a proper name of a person, company etc.
net: A hint that the element's content is a URI such as an email address or an http URI.

The above definitions were taken from the SSML specification dated 23 August 2002, but there is continuing uncertainty as to how the SSML "say-as" mechanism will end up. If SSML doesn't provide the same set of input types, then it may be necessary to transform the content before it can be directly mapped into SSML. We may just choose to drop 'interpret-as' until a future revision of this specification when the details of the say-as element have been clarified.

Appendix A : Changes from previous versions

This section needed revising. To be done.

Appendix B : Profiles

We have at least 4 profiles: level 1, level 2, level 3 and full. To be done.

Appendix C : Acknowledgements

To be done.

Appendix C : References

To be done.