Specification for Spoken Presentation in HTML

Abstract

Accurate pronunciation by text-to-speech (TTS) synthesis is very important in many contexts, and critical in education, publishing, communication, entertainment, among other domains. TTS has become an important technology for providing access to digital content on the web. Yet there is no way to markup content today that will correctly present TTS generated output across commonly used TTS engines and operating environments.

We identify two markup approaches in this publication to give content authors reliable pronunciation of HTML content regardless of the operating environment (or assistive technology) users might choose to use. Each approach has been demonstrated to yield consistent results. We seek feedback from authors and implementors to help determine which approach should be advanced to normative recommendation status by W3C.

We base each candidate approach on a subset of Speech Synthesis Markup Language (SSML). Our selected subset is carefully chosen to bring consistency and predictability to spoken presentation across a full range of assistive technologies and operating environments. Both technical approaches described in this publication carefully avoid the impasse that has prevented SSML from becoming a native HTML technology and should, therefore, be generally applicable. Either approach described here satisfies our requirements for assistive technologies and will be useful to voice assistants which consume and present HTML content in spoken form. We seek feedback on which approach would prove most implementable across all applications of spoken presentation of web content.

1. Introduction

This section is non-normative.

In this First Public Working Draft (FPWD) publication we define two independent approaches for achieving accurate, consistent, and reliable pronunciation by Text-to-Speech (TTS) engines across all operating environments, regardless of any assistive technology also utilized. We are publishing two approaches now in order to obtain feedback from the wider community on which of these two approaches is deemed to be preferable—and why.

Text-to-speech is necessary for people with disabilities and useful for all. Accurate pronunciation is essential in many situations such as in education and educational assessment (testing students). Many computers and mobile devices today have built-in TTS functionality that is also commonly used by people without disabilities in different situations, such as when driving or interacting with personal data assistants.

The W3C's Web Content Accessibility Guidelines (WCAG) emphasize the importance of correct pronunciation.

For example, in the English language heteronyms are words that are spelled the same but have different pronunciations and meanings, such as the words desert (abandon) and desert (arid region). … Additionally, in some languages certain characters can be pronounced in different ways. In Japanese, for example, there are characters like Han characters(Kanji) that have multiple pronunciations. … When read incorrectly, the content will not make sense to users. Understanding Success Criterion 3.1.6: Pronunciation

While WCAG provides numerous workarounds for indicating correct pronunciation, it is nevertheless forced to categorize Success Criterion 3.1.6 as AAA because it cannot point to a reliable technological solution.

The W3C has two mature pronunciation related specifications:

Speech Synthesis Markup Language (SSML)
Pronunciation Lexicon Specification (PLS)

These have long provided technical methods to allow authors to embed pronunciation (and related spoken presentation) markup in their HTML document, but SSML's approach has not been adopted for several technical reasons. Additionally, feedback from various browser and assistive technology vendors has indicated that this is not a likely or viable approach. In our specification therefore, we do not attempt to reinvent this wheel, but rather to bridge the long-standing technical barriers that have prevented use of SSML in HTML. Our approaches rely directly on both specifications, wrapping them as attributes which will be accepted by HTML validation.

As noted we have identified two candidate approaches:

multi-attribute — uses one or more element attributes with string values to convey each SSML function and property.
single-attribute — uses a single element attribute with a JavaScript object notation (JSON) string to convey all SSML functions and properties.

The task force encourages implementors and authors to provide feedback about these approaches. Once analyzed, the feedback will help determine which approach will become the final normative W3C recommendation.

The following sections include example code for each approach. Please refer to the sample content examples, complete with audio files, for comparison with existing TTS technology.

Editor's note

Using the data- prefix to name attributes is not the editors' recommendation or preference. Rather, it is the canonical approach for developing enhancements to HTML as defined in the HTML 5.x specification. This standards-based development approach enables experimental implementations which, in turn, will inform the further development of this specification.

For a more in-depth introduction to pronunciation issues and related W3C documents, please refer to Pronunciation Overview.

2. Multi-attribute Approach for Including SSML in HTML

By converting SSML tags and attributes to HTML attributes, authors can embed pronunciation (and related spoken presentation) in their HTML document. Authors can combine most supported tags with each other to apply multiple speech effects.

Edgar Allen Poe's The Raven:

EXAMPLE 1

<p data-ssml-prosody-rate="slow" data-ssml-prosody-pitch="low">
    Once upon a midnight 
    <span data-ssml-phoneme-alphabet="ipa" data-ssml-phoneme-ph="ˈdrɪəri">dreary</span>
    <span data-ssml-break-time="500ms"></span>,
    while I pondered, weak
    <span data-ssml-break-time="150ms"></span> and weary,<br data-ssml-break-time="500ms" />
    Over many a quaint and curious volume of forgotten
    <span data-ssml-prosody-rate="x-slow" data-ssml-prosody-pitch="low"> lore—</span><br />
    While I nodded, nearly napping, suddenly there came a tapping,
    <br data-ssml-audio-src="/soundlibrary/wood/hits/hits_11" />
    As of some one gently rapping,
    <span data-ssml-audio-src="/soundlibrary/wood/hits/hits_11"></span>
    rapping at my chamber door.
    <span data-ssml-audio-src="/soundlibrary/wood/hits/hits_11"></span>
    <br data-ssml-audio-src="/soundlibrary/wood/hits/hits_11" />
    <span data-ssml-prosody-volume="x-soft" data-ssml-prosody-rate="medium">
      "'Tis some visitor,"
    </span>
    I muttered, <span data-ssml-prosody-volume="x-soft" data-ssml-prosody-rate="x-slow">
    <span data-ssml-phoneme-alphabet="ipa" data-ssem-phoneme-ph="tæpɪŋ">"tapping</span>
    at my chamber door—</span><br data-ssml-break-time="750ms" />
    Only this <span data-ssml-break-strength="weak"></span> and nothing
    <span data-ssml-break-strength="none"></span>
    <span data-ssml-prosody-volume="soft" data-ssml-prosody-rate="75%"> more."</span>
</p>

2.1 The `data-ssml-*` Multi-Attribute Set

These attributes provide functional equivalence to the SSML counterparts. These attributes are valid on the following HTML elements:

2.1.1 `data-ssml-say-as(-*)`

Allows the author to classify the element's text content. The attributes are derived from the SSML say-as element and associated properties.

Editor's note

interpret-as seems superfluous, and should be implied

`data-ssml-say-as`

`data-ssml-say-as-format` (optional)

Value: time/date format as defined in W3C Note, SSML say-as attribute values. SSML 1.0 say-as attribute

`data-ssml-say-as-detail` (optional)

Value: detail as defined in W3C Note, SSML say-as attribute values. SSML 1.0 say-as attribute

Editor's note

The data-ssml-say-as-detail attribute allows authors to target implementation-specific TTS engine features or behavior.

EXAMPLE 2

According the 2010 US Census, the population of <span data-ssml-say-as='characters'>90274</span>
increased to 25209 from 24976 over the past 10 years.

2.1.2 `data-ssml-phoneme-*`

Defines two required attributes for phonemic/phonetic pronunciation. The element with the phoneme attributes can only contain text (no elements). The attributes are derived from the SSML phoneme element and associated properties.

`data-ssml-phoneme-ph`

Value: The phoneme string.

`data-ssml-phoneme-alphabet`

Values: The phonetic alphabet in use. ipa | x-sampa

EXAMPLE 3

Once upon a midnight <span data-ssml-alphabet="ipa" data-ssml-phoneme-ph="ˈdrɪəri">dreary</span>

2.1.3 `data-ssml-sub-alias`

A string value that replaces the text content for pronunciation. While similar to aria-label, alias does not alter spelling (i.e., a Braille display). Additionally, the alias attribute can be used by TTS technologies that do not access the accessibility tree. The processor should apply text normalization to the alias value. The attribute is derived from the SSML sub element and associated properties.

Value: text string to be substituted and delivered to the TTS for presentation.

EXAMPLE 4

<span data-ssml-sub-alias="Sodium Chloride">NaCL</span>

2.1.4 `data-ssml-voice-*`

A set of attributes defining production values that requests a change in speaking voice. There are two kinds of attributes for the voice element: those that indicate desired features of a voice and those that control behavior. The attributes are derived from the SSML voice element and associated properties.

`data-ssml-voice-gender` (optional)

Values: female | male | neutral

`data-ssml-voice-age` (optional)

Value: integer corresponding to age in years

`data-ssml-voice-variant` (optional)

Value: integer indicating a numeric voice variant

`data-ssml-voice-name` (optional)

Values: specific voice name requested from the current TTS engine (e.g., "David").

`data-ssml-voice-languages` (optional)

Value: string a space delimited list of one or more languages to be spoken by this voice.

Editor's note

The data-ssml-voice-languages attribute only assists the TTS engine with selecting the appropriate voice. It does not indicate the language of content. To specify language, use the HTML lang attribute.

EXAMPLE 5

She said, "<span data-ssml-voice-gender="female">My name is Marie</span>".

2.1.5 `data-ssml-emphasis-level`

Requests that the text content be spoken with emphasis (also referred to as prominence or stress). This is a single attribute and is derived from the SSML emphasis element and associated properties.

Values: strong | moderate | none | reduced

EXAMPLE 6

Please use <span data-ssml-emphasis-level="strong">extreme caution.</span>

2.1.6 `data-ssml-break-*`

Describes the timing associated with an empty element to control the pausing or other prosodic boundaries between tokens. The use of the break attribute between any pair of tokens is optional. If the element is not present between tokens, the synthesis processor is expected to automatically determine a break based on the linguistic context. The attributes are derived from the SSML break element and associated properties.

`data-ssml-break-strength`

`data-ssml-break-time`

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc.

EXAMPLE 7

Take a deep breath,<span data-ssml-break-time="1s"></span> and exhale.

2.1.7 `data-ssml-prosody-*`

Permits control of the pitch, speaking rate and volume of the speech output. The attributes are derived from the SSML prosody element and associated properties.

`data-ssml-prosody-pitch` (optional)

`data-ssml-prosody-contour` (optional)

Value: string of contour change parameters as defined in the SSML 1.1 recommendation.

`data-ssml-prosody-range` (optional)

Value: string range value as defined in the SSML 1.1 recommendation.

`data-ssml-prosody-rate` (optional)

`data-ssml-prosody-duration` (optional)

Value: string containing a time duration (e.g., "250ms", "1s", etc.).

`data-ssml-prosody-volume` (optional)

EXAMPLE 8

The tortoise, said (slowly) "<span data-ssml-prosody-rate="x-slow">
I am almost at the finish line</span>."

2.1.8 `data-ssml-audio-*`

Supports the insertion of recorded audio files in conjunction with synthesized speech output. The element may be empty. If the element is not empty, then the contents should be spoken if the audio document is not available. The attributes are derived from the SSML audio element and associated properties.

`data-ssml-audio-src`

Value: The URI of a document with an appropriate media file.

`data-ssml-audio-fetchtimeout` (optional)

Value: string containing a time duration (e.g., "250ms", "1s", etc.).

`data-ssml-audio-fetchint` (optional)

Values: safe | prefetch

`data-ssml-audio-maxage` (optional)

Value: string

`data-ssml-audio-maxstale` (optional)

Value: string

`data-ssml-audio-clipBegin` (optional)

Value: string containing a time duration (e.g., "250ms", "1s", etc.).

`data-ssml-audio-clipEnd` (optional)

Value: string containing a time duration (e.g., "250ms", "1s", etc.).

`data-ssml-audio-repeatCount` (optional)

Value: integer indicating the number of times to repeat the audio clip.

`data-ssml-audio-repeatDur` (optional)

Value: string containing a time duration (e.g., "250ms", "1s", etc.).

EXAMPLE 9

You will hear a brief chime <span data-ssml-audio-src="/audio/chime.ogg"></span> 
when your time is up.

3. Single-attribute Approach for Including SSML in HTML

By converting SSML tags and attributes into a single HTML attribute with a JSON string value, authors can embed pronunciation (and related spoken presentation) in their HTML document. Authors can combine most supported tags with each other to apply multiple speech effects.

Most of the markup included in SSML is suitable for use by the majority of content developers; however, some features such as, phoneme and prosody may require specialized knowledge. This approach emerged as a means to transform content conforming to the IMS Question & Test Interoperability (QTI) Specification. The QTI standard supports inclusion of SSML in HTML for TTS tools used in educational assessment.

Edgar Allen Poe's The Raven:

EXAMPLE 1

<p data-ssml='{"prosody":{"rate":"slow","pitch":"low"}}'>
	Once upon a midnight
        <span data-ssml='{"phoneme":{"alphabet":"ipa","ph":"ˈdrɪəri"}}'>dreary</span>
	<span data-ssml='{"break":{"time":"500ms"}'></span>,
	while I pondered, weak
	<span data-ssml='{"break":{"time":"150ms"}'></span> and weary,
        <br data-ssml='{"break":{"time":"500ms"}' />
	Over many a quaint and curious volume of forgotten 
	<span data-ssml='{"prosody":{"rate":"x-slow","pitch":"low"}}'>lore—</span><br />
	While I nodded, nearly napping, suddenly there came a tapping,
	<br data-ssml='{"audio":{"src":"/soundlibrary/wood/hits/hits_11"}}'/>
	As of some one gently rapping,
	<span data-ssml='{"audio":{"src":"/soundlibrary/wood/hits/hits_11"}}'></span>
	rapping at my chamber door.
	<span data-ssml='{"audio":{"src":"/soundlibrary/wood/hits/hits_11"}}'></span>
	<br data-ssml='{"audio":{"src":"/soundlibrary/wood/hits/hits_11"}}' />
	<span data-ssml='{"prosody":{"volume":"x-soft","rate":"medium"}}'>
          "'Tis some visitor,"
        </span>
	I muttered, <span data-ssml='{"prosody":{"volume":"x-soft","rate":"x-slow"}}'>
	<span data-ssml='{"phoneme":{"alphabet":"ipa","ph":"tæpɪŋ"}}'>"tapping</span>
	at my chamber door—</span><br data-ssml='{"break":{"time":"750ms"}'/>
	Only this<span data-ssml='{"break":{"strength":"weak"}'></span>
	and nothing<span data-ssml='{"break":{"strength":"none"}'> </span>
	<span data-ssml='{"prosody":{"volume":"soft","rate":"75%"}}'>more."</span>
</p>

3.1 `data-ssml` Attribute, Properties and Values

The following properties are defined and provide functional equivalence to the their SSML counterpart.

The data-ssml provides functional equivalence to SSML. The attribute is valid on the following HTML elements:

The value of the data-ssml attribute is a JSON string, enclosed with single quotes ('), containing a single JSON object representing a specific SSML function with one or more property/value pairs. The valid objects, properties and associated values are defined in the following sections. The JSON schema is presented in Appendix A.

3.1.1 `say-as`

Allows the author to classify the element's text content. The JSON definition is derived from the SSML say-as element and associated properties.

`interpret-as`

`format` (optional)

Value: time/date format as defined in W3C Note SSML say-as attribute values.

`detail` (optional)

Value: detail as defined in W3C Note SSML say-as attribute values.

Editor's note

The detail property allows authors to target implementation-specific TTS engine features or behavior.

EXAMPLE 2

According the 2010 US Census, the population of 
<span data-ssml='{"say-as":{"interpret-as":"characters"}}'>90274</span>
increased to 25209 from 24976 over the past 10 years.

3.1.2 `phoneme`

Defines two required attributes for phonemic/phonetic pronunciation. The element with the phoneme attributes can only contain text (no elements). The JSON definition is derived from the SSML phoneme element and associated properties.

`ph`

Value: string containing the phonetic characters corresponding to the content to be spoken

`data-ssml-phoneme-alphabet`

Value: ipa | x-sampa defining the phonetic alphabet used for the ph string

EXAMPLE 3

Once upon a midnight 
<span data-ssml='{"phoneme":{"alphabet":"ipa","ph":"ˈdrɪəri"}}'>dreary</span>

3.1.3 `sub`

Indicates that the text in the alias attribute value replaces the text content for pronunciation. The required alias property specifies the string to be spoken instead of the text content. The processor should apply text normalization to the alias value. The JSON definition is derived from the SSML sub element and associated properties.

`alias`

Value: string containing the text to be spoken as a substitution for the text content of the element to which sub is applied.

EXAMPLE 4

<span data-ssml='{"sub":{"alias":"Sodium Chloride"}}'>NaCL</span>

3.1.4 `voice`

Requests a change in speaking voice. There are two kinds of attributes for voice: those that indicate desired features of a voice and those that control behavior. The JSON definition is derived from the SSML voice element and associated properties.

`gender` (optional)

Values: female | male | neutral

`age` (optional)

Value: integer corresponding to age in years

`variant` (optional)

Value: integer indicating a numeric voice variant

`name` (optional)

Value: string defining a specific voice name requested from the current TTS engine, e.g., "Microsoft David (English)"

`languages` (optional)

Value: string a space delimited list of one or more languages to be spoken by this voice.

Editor's note

The voice > languages property only assists the TTS engine with selecting the appropriate voice. It does not indicate the language of content. To specify language, use the HTML lang attribute.

EXAMPLE 5

She said, "<span data-ssml='{"voice":{"gender":"female"}}'>My name is Marie</span>".

3.1.5 `emphasis`

Requests that the text content of the element to which emphasis spoken with emphasis (also referred to as prominence or stress). The JSON definition is derived from the SSML emphasis element and associated properties.

`level`

Value: strong | moderate | none | reduced

EXAMPLE 6

Please use <span data-ssml='{"emphasis":{"level":"strong"}}'>extreme caution.</span>

3.1.6 `break`

Describes the timing associated with an empty element to control the pausing or other prosodic boundaries between tokens. The use of the break between any pair of tokens is optional. If the element is not present between tokens, the synthesis processor is expected to automatically determine a break based on the linguistic context. The JSON definition is derived from the SSML break element and associated properties.

`strength`

`time`

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc. (s=second, ms=milliseconds)

EXAMPLE 7

Take a deep breath,<span data-ssml='{"break":{"time":"1s"}}'></span> and exhale.

3.1.7 `prosody`

Permits control of the pitch, speaking rate and volume of the speech output. The object has six properties. The JSON definition is derived from the SSML prosody element and associated properties.

`pitch`

`contour`

Value: string of contour change parameters as defined in the SSML 1.1 recommendation

`range`

Value: string range value as defined in the SSML 1.1 recommendation

`rate`

`duration`

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc.

`volume`

EXAMPLE 8

The tortoise, said (slowly) "
<span data-ssml='{"prosody":{"rate":"x-slow"}}'>I am almost at the finish line</span>."

3.1.8 `audio`

Supports the insertion of recorded audio files in conjunction with synthesized speech output. The element may be empty. If the element is not empty, then the contents should be the text to be spoken if the audio document is not available. The JSON definition is derived from the SSML audio element and associated properties.

`src`

Value: The URI of a document with an appropriate media file.

`fetchtimeout`

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc.

`fetchint`

Value: safe | prefetch

`maxage`

Value: string

`maxstale`

Value: string

`clipBegin`

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc.

`clipEnd`

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc.

`repeatCount`

Value: integer indicating the number of times to repeat the audio clip.

`repeatDur`

Value: string containing a time duration expressed in numeric form as "250ms", "1s", etc.

EXAMPLE 9

You will hear a brief chime 
<span data-ssml='{"audio":{"src":"/audio/chime.ogg"}}'></span> when your time is up.

Specification for Spoken Presentation in HTML

W3C Working Draft 23 September 2021

Abstract

Status of This Document

1. Introduction

2. Multi-attribute Approach for Including SSML in HTML

2.1 The data-ssml-* Multi-Attribute Set

2.1.1 data-ssml-say-as(-*)

data-ssml-say-as

data-ssml-say-as-format (optional)

data-ssml-say-as-detail (optional)

2.1.2 data-ssml-phoneme-*

data-ssml-phoneme-ph

data-ssml-phoneme-alphabet

2.1.3 data-ssml-sub-alias

2.1.4 data-ssml-voice-*

data-ssml-voice-gender (optional)

data-ssml-voice-age (optional)

data-ssml-voice-variant (optional)

data-ssml-voice-name (optional)

data-ssml-voice-languages (optional)

2.1.5 data-ssml-emphasis-level

2.1.6 data-ssml-break-*

data-ssml-break-strength

data-ssml-break-time

2.1.7 data-ssml-prosody-*

data-ssml-prosody-pitch (optional)

data-ssml-prosody-contour (optional)

data-ssml-prosody-range (optional)

data-ssml-prosody-rate (optional)

data-ssml-prosody-duration (optional)

data-ssml-prosody-volume (optional)

2.1.8 data-ssml-audio-*

data-ssml-audio-src

data-ssml-audio-fetchtimeout (optional)

data-ssml-audio-fetchint (optional)

data-ssml-audio-maxage (optional)

data-ssml-audio-maxstale (optional)

data-ssml-audio-clipBegin (optional)

data-ssml-audio-clipEnd (optional)

data-ssml-audio-repeatCount (optional)

data-ssml-audio-repeatDur (optional)

3. Single-attribute Approach for Including SSML in HTML

3.1 data-ssml Attribute, Properties and Values

3.1.1 say-as

interpret-as

format (optional)

detail (optional)

3.1.2 phoneme

ph

data-ssml-phoneme-alphabet

3.1.3 sub

alias

3.1.4 voice

gender (optional)

age (optional)

variant (optional)

name (optional)

languages (optional)

3.1.5 emphasis

level

3.1.6 break

strength

time

3.1.7 prosody

pitch

contour

range

rate

duration

volume

3.1.8 audio

src

fetchtimeout

fetchint

maxage

maxstale

clipBegin

clipEnd

repeatCount

2.1 The `data-ssml-*` Multi-Attribute Set

2.1.1 `data-ssml-say-as(-*)`

`data-ssml-say-as`

`data-ssml-say-as-format` (optional)

`data-ssml-say-as-detail` (optional)

2.1.2 `data-ssml-phoneme-*`

`data-ssml-phoneme-ph`

`data-ssml-phoneme-alphabet`

2.1.3 `data-ssml-sub-alias`

2.1.4 `data-ssml-voice-*`

`data-ssml-voice-gender` (optional)

`data-ssml-voice-age` (optional)

`data-ssml-voice-variant` (optional)

`data-ssml-voice-name` (optional)

`data-ssml-voice-languages` (optional)

2.1.5 `data-ssml-emphasis-level`

2.1.6 `data-ssml-break-*`

`data-ssml-break-strength`

`data-ssml-break-time`

2.1.7 `data-ssml-prosody-*`

`data-ssml-prosody-pitch` (optional)

`data-ssml-prosody-contour` (optional)

`data-ssml-prosody-range` (optional)

`data-ssml-prosody-rate` (optional)

`data-ssml-prosody-duration` (optional)

`data-ssml-prosody-volume` (optional)

2.1.8 `data-ssml-audio-*`

`data-ssml-audio-src`

`data-ssml-audio-fetchtimeout` (optional)

`data-ssml-audio-fetchint` (optional)

`data-ssml-audio-maxage` (optional)

`data-ssml-audio-maxstale` (optional)

`data-ssml-audio-clipBegin` (optional)

`data-ssml-audio-clipEnd` (optional)

`data-ssml-audio-repeatCount` (optional)

`data-ssml-audio-repeatDur` (optional)

3.1 `data-ssml` Attribute, Properties and Values

3.1.1 `say-as`

`interpret-as`

`format` (optional)

`detail` (optional)

3.1.2 `phoneme`

`ph`

`data-ssml-phoneme-alphabet`

3.1.3 `sub`

`alias`

3.1.4 `voice`

`gender` (optional)

`age` (optional)

`variant` (optional)

`name` (optional)

`languages` (optional)

3.1.5 `emphasis`

`level`

3.1.6 `break`

`strength`

`time`

3.1.7 `prosody`

`pitch`

`contour`

`range`

`rate`

`duration`

`volume`

3.1.8 `audio`

`src`

`fetchtimeout`

`fetchint`

`maxage`

`maxstale`

`clipBegin`

`clipEnd`

`repeatCount`

`repeatDur`