The Protocols and Formats Working Group is no longer chartered to operate. Its work will continue in two new Working Groups:

  • Accessible Platform Architectures, to review specifications, develop technical support materials, collaborate with other Working Groups on technology accessibility, and coordinate harmonized accessibility strategies within W3C; and
  • Accessible Rich Internet Applications, to continue development of the Accessible Rich Internet Applications (WAI-ARIA) suite of technologies and other technical specifications when needed to bridge known gaps.

Resources from the PFWG remain available to support long-term institutional memory, but this information is of historical value only.

This Wiki page was edited by participants of the Protocols and Formats Working Group. It does not necessarily represent consensus and it may have incorrect information or information that is not supported by other Working Group participants, WAI, or W3C. It may also have some very useful information.

CSS/Spec Review/Speech

From Protocols and Formats Working Group Wiki
Jump to: navigation, search

Andi Snow-Weaver

5.1 There would seem to be a need for a relative value from current setting. Louder / Softer.

5.1 There exists the possibility of damaging content to be created. Imagine a web page where it is very soft, and then in the middle the maximum decibels are shouted. Deliberate suffering could be created, not unlike deliberately creating a photosentive epileptic situation. Should this be prevented?

10.1 Should there be a way to specify an accent or locale setting?

10.5 Consider the more descriptive name: voice-emphasis, rather than voice-stress. Stress sounds only angry. Emphasis has less emotion to it.

James Craig

Comments are mainly with regards to screen reader usage, as we believe it to be the primary use case for CSS 3 Speech. If some of these properties or values are specific to other use cases, the document should mention that scope. For example, we understand that this could be used to "Save this page as an audio file." but do not believe that to be a common usage, and we believe our potential concerns for screen reader usage outweigh any benefit these properties may provide for the less-common scenarios.



This property is of concern because it appears to allow page authors to hijack the user interface. Screen reader users tend to set their speech volume at an audible, but comfortable level, and allowing an author to set the volume to x-loud or a high decibel could be a very disruptive experience. Furthermore, some screen reader users also have hearing impairments, so allowing an author to set the volume to x-soft or a low decibel could result in content being inaccessible to those users. We would like to suggest the group reconsider or further explain the necessity for this property, or at least consider removing the x-* values and decibel support. The spec's note ("listening environment and personal user preferences") at the end of this section appears to confirm my concern that this property is perhaps immature, and it would be unwise to implement this without additional consideration of other vague, unspecified details such as user preference overrides, and the ability for user agents to be more aware of their usage environment or context.

voice-volume: silent;

The spec should give an example of expected appropriate usage of this value. Because this generates a period of silence equal to the length of the would-be-spoken content, most listeners will just assume speech output has prematurely stopped. In radio terms, this is "dead air." How do you expect this value to be useful?


Despite the at-risk status of this property, we believe it would be extremely useful for conveying context, particularly in situations such as two-party dialogue.


WebKit and VoiceOver in the iOS5 betas implement partial support for the original values of the 'speak' property in CSS 2.1 as well as some additional values defined by the previous working draft of CSS 3, which seemed a logical progression from the CSS 2.1. Since the Working Group had not published an updated draft in over five years, we would not have expected this property to change so drastically. Please reconsider this property split, since 1) it is not apparent why the split was made, and 2) there is existing implementation that is unlikely to change in the pending release.

Previous values, from the most recent draft published in December 2004.

speak:/speak-as: values.

Whether or not the 'speak' and 'speak-as' properties are recombined, the values for the 'speak-as' property are listed as single token values, but are not mutually exclusive. We would expect to be able to use a token list to specify multiple values that apply. Perhaps:

.telephone {
     speak-as: digits no-punctuation; /* e.g., (415) 555-1212 */
.internetProtocol {
     speak-as: digits literal-punctuation; /* e.g., */


These properties are of concern because they represent another way for the page author to hijack a screen reader user's experience. We are also concerned that end users will interpret correct implementation of these properties as a severe performance lag. For example, if a user were forced to wait 2 seconds between each heading, the experience would be tedious for TTS users comfortable with machine speech at rates pushing 400 words per minute.

If you plan to keep this property, we suggest the following:

  1. Consider defining a few variants of the @media values defining the particular speech context. A long pause may provide slightly more value for the "save to audio file" or "read all" context than it would to a general screen reader user in the process of navigating a document quickly. We think it's unlikely that many screen reader users would want this feature affecting their TTS speed and responsiveness.
  2. Define a maximum range for pause-before <time>, preferably less than 2s for screen readers, and issue validation warnings for times over the maximum.
  3. Define millisecond values or WPM-relative time values for tokens, preferably all less than 1s. The document states that this it implementation-dependent. W3C history has shown this will result in drastically different values, and inconsistent implementation will be frustrating for authors and users alike.
  4. In a separate document (perhaps HTML5) define default mappings of elements to their expected pause values. e.g. A table mapping pause before/after columns with each HTML element as a row.
  5. Unequivocally declare that implementors should ignore pause-before values when navigating to an element in the screen reader context, so as to not create the perception of performance lag. e.g., If a screen reader user presses the command to "jump to next heading," speak it immediately. Ignore pause-before immediately after a focus change.


Consider token-based named sound icons, such as "warning", "error", or "progress-complete." Leave this flexible for platform- and implementation-specific values, such as "-osx-tink" or "-ios-tweet" and provide a comma-delimited fallback in the same way a user can specify a generic family fallback in addition to a named font:

font-family: "MyFont", sans-serif;

cue-before:/cue-after: <decibel> properties.

We have the same concern with this decibel value as mentioned above with voice-volume.

voice-family: preserve;

Quoting from the editor's draft:

Indicates that the ‘voice-family’ value gets inherited and used regardless of any potential language change within the content markup

This property value appears short-sighted, as most TTS voices are not only intended for a particular language, but are also mostly incapable of producing speech when confronted with characters outside its intended range of unicode characters. For example, it is highly unlikely that a Chinese TTS voice will be able to pronounce English in an understandable way (for anything other than very common words such as "okay") , and it's even less likely that a French TTS voice would be able to speak any words in Japanese. It seems this property value is only beneficial to force Western language TTS voices to mispronounce other Western languages, which is a feature of very little utility.

voice-duration: <time>;

This is another property that seems to provide very little worth. For example, what would be the expected behavior given the following CSS:

p { voice-duration: 1s; }

Given the following markup.

<p>Short paragraph.</p>

<p>Longer paragraph. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Integer elementum interdum ullamcorper. Nunc et ante dui. Sed odio erat, dictum
vitae adipiscing nec, aliquam sed nibh. Fusce pharetra ante dolor. </p>

The first paragraph would be understandable, but should the second paragraph really be pronounced over a duration of 1 second? Probably not. Implementation of this would be tricky, too. Have any other vendors have expressed an interest in its implementation?

voice-duration is marked as at-risk, and we support dropping it from the final specification.


Seems to provide limited utility, hijacking, and implementation difficulty. voice-stress is marked as at-risk, and we support dropping it from the final specification.

Peter Thiessen

Is the intent to cover "styling" dynamic content as well? I really tried to find conflicting ARIA states and properties. I haven't found any yet.