Pronunciation Use Cases

Abstract

The objective of the Pronunciation Task Force is to develop normative specifications and best practices guidance collaborating with other W3C groups as appropriate, to provide for proper pronunciation in HTML content when using text to speech (TTS) synthesis. This document provides various use cases highlighting the need for standardization of pronunciation markup, to ensure that consistent and accurate representation of the content. The requirements from the user scenarios provide the basis for these technical requirements/specifications.

This specification is obsolete. Please see the latest Pronunciation Gap Analysis and Use Cases for the Pronunciation Use Cases.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.

This document was published by the Accessible Platform Architectures Working Group as a Working Group Note.

Comments regarding this document are welcome. Please send them to public-pronunciation@w3.org (archives).

Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the W3C Patent Policy.

This document is governed by the 1 March 2019 W3C Process Document.

2. Use Case aria-ssml

2.1 Background and Current Practice

A new aria attribute could be used to include pronunciation content.

2.2 Goal

Embed SSML in an HTML document.

2.3 Target Audience

Assistive Technology
Browser Extensions
Search Engines

2.4 Implementation Options

aria-ssml as embedded JSON

When AT encounters an element with aria-ssml, the AT should enhance the UI by processing the pronunciation content and passing it to the Web Speech API or an external API (e.g., Google's Text to Speech API).

Example 1

I say <span aria-ssml='{"phoneme":{"ph":"pɪˈkɑːn","alphabet":"ipa"}}'>pecan</span>.
You say <span aria-ssml='{"phoneme":{"ph":"ˈpi.kæn","alphabet":"ipa"}}'>pecan</span>.

Client will convert JSON to SSML and pass the XML string a speech API.

Example 2

var msg = new SpeechSynthesisUtterance();
msg.text = convertJSONtoSSML(element.getAttribute('aria-ssml'));
speechSynthesis.speak(msg);

aria-ssml referencing XML by template ID

Example 3

<!-- ssml must appear inside a template to be valid -->
<template id="pecan">
<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
    You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
    I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak>
</template>

<p aria-ssml="#pecan">You say, pecan. I say, pecan.</p>

Client will parse XML and serialize it before passing to a speech API:

Example 4

var msg = new SpeechSynthesisUtterance();
var xml = document.getElementById('pecan').content.firstElementChild;
msg.text = serialize(xml);
speechSynthesis.speak(msg);

aria-ssml referencing an XML string as script tag

Example 5

<script id="pecan" type="application/ssml+xml">
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
    You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
    I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak>
</script>

<p aria-ssml="#pecan">You say, pecan. I say, pecan.</p>

Client will pass the XML string raw to a speech API.

Example 6

var msg = new SpeechSynthesisUtterance();
msg.text = document.getElementById('pecan').textContent;
speechSynthesis.speak(msg);

aria-ssml referencing an external XML document by URL

Example 7

<p aria-ssml="http://example.com/pronounce.ssml#pecan">You say, pecan. I say, pecan.</p>

Client will pass the string payload to a speech API.

Example 8

var msg = new SpeechSynthesisUtterance();
var response = await fetch(el.dataset.ssml)
msg.txt = await response.text();
speechSynthesis.speak(msg);

2.5 Existing Work

2.6 Problems and Limitations

aria-ssml is not a valid aria-* attribute.
OS/Browsers combinations that do not support the serialized XML usage of the Web Speech API.

3. Use Case data-ssml

3.1 Background and Current Practice

As an existing attribute, data-* could be used, with some conventions, to include pronunciation content.

3.2 Goal

Support repeated use within the page context
Support external file references
Reuse existing techniques without expanding specifications

3.3 Target Audience

Hearing users

3.4 Implementation Options

data-ssml as embedded JSON

When an element with data-ssml is encountered by an SSML-aware AT, the AT should enhance the user interface by processing the referenced SSML content and passing it to the Web Speech API or an external API (e.g., Google's Text to Speech API).

Example 9

<h2>The Pronunciation of Pecan</h2>
<p><speak>
I say <span data-ssml='{"phoneme":{"ph":"pɪˈkɑːn","alphabet":"ipa"}}'>pecan</span>.
You say <span data-ssml='{"phoneme":{"ph":"ˈpi.kæn","alphabet":"ipa"}}'>pecan</span>.

Client will convert JSON to SSML and pass the XML string a speech API.

Example 10

var msg = new SpeechSynthesisUtterance();
msg.text = convertJSONtoSSML(element.dataset.ssml);
speechSynthesis.speak(msg);

data-ssml referencing XML by template ID

Example 11

<!-- ssml must appear inside a template to be valid -->
<template id="pecan">
<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
    You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
    I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak>
</template>

<p data-ssml="#pecan">You say, pecan. I say, pecan.</p>

Client will parse XML and serialize it before passing to a speech API:

Example 12

var msg = new SpeechSynthesisUtterance();
var xml = document.getElementById('pecan').content.firstElementChild;
msg.text = serialize(xml);
speechSynthesis.speak(msg);

data-ssml referencing an XML string as script tag

Example 13

<script id="pecan" type="application/ssml+xml">
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
    You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
    I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak>
</script>

<p data-ssml="#pecan">You say, pecan. I say, pecan.</p>

Client will pass the XML string raw to a speech API.

Example 14

var msg = new SpeechSynthesisUtterance();
msg.text = document.getElementById('pecan').textContent;
speechSynthesis.speak(msg);

data-ssml referencing an external XML document by URL

Example 15

<p data-ssml="http://example.com/pronounce.ssml#pecan">You say, pecan. I say, pecan.</p>

Client will pass the string payload to a speech API.

Example 16

var msg = new SpeechSynthesisUtterance();
var response = await fetch(el.dataset.ssml)
msg.txt = await response.text();
speechSynthesis.speak(msg);

3.5 Existing Work

3.6 Problems and Limitations

Does not assume or suggest visual pronunciation help for deaf or hard of hearing
Use of data-* requires input from AT vendors
XML data is not indexed by search engines

4. Use Case HTML5

4.1 Background and Current Practice

HTML5 includes the XML namespaces for MathML and SVG. So, using either's elements in an HTML5 document is valid. Because SSML's implementation is non-visual in nature, browser implementation could be slow or non-existent without affecting how authors use SSML in HTML. Expansion of HTML5 to include SSML namespace would allow valid use of SSML in the HTML5 document. Browsers would treat the element like any other unknown element, as HTMLUnknownElement.

4.2 Goal

Support valid use of SSML in HTML5 documents
Allow visual pronunciation support

4.3 Target Audience

SSML-aware technologies and browser extensions
Search indexers

4.4 Implementation Options

SSML

Example 17

<h2>The Pronunciation of Pecan</h2>
  <p><speak>
  You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
  I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak></p>

4.5 Existing Work

4.6 Problems and Limitations

SSML is not valid HTML5

5. Use Case Custom Element

5.1 Background and Current Practice

Embed valid SSML in HTML using custom elements registered as ssml-* where * is the actual SSML tag name (except for p which expects the same treatment as an HTML p in HTML layout).

5.2 Goal

Support use of SSML in HTML documents.

5.3 Target Audience

SSML-aware technologies and browser extensions
Search indexers

5.4 Implementation Options

ssml-speak: see demo

Only the <ssml-speak> component requires registration. The component code lifts the SSML by getting the innerHTML and removing the ssml- prefix from the interior tags and passing it to the web speech API. The <p> tag from SSML is not given the prefix because we still want to start a semantic paragraph within the content. The other tags used in the example have no semantic meaning. Tags like <em> in HTML could be converted to <emphasis> in SSML. In that case, CSS styles will come from the browser's default styles or the page author.

Example 18

<ssml-speak>
  Here are <ssml-say-as interpret-as="characters">SSML</ssml-say-as> samples.
  I can pause<ssml-break time="3s"></ssml-break>.
  I can speak in cardinals.
  Your number is <ssml-say-as interpret-as="cardinal">10</ssml-say-as>.
  Or I can speak in ordinals.
  You are <ssml-say-as interpret-as="ordinal">10</ssml-say-as> in line.
  Or I can even speak in digits.
  The digits for ten are <ssml-say-as interpret-as="characters">10</ssml-say-as>.
  I can also substitute phrases, like the <ssml-sub alias="World Wide Web Consortium">W3C</ssml-sub>.
  Finally, I can speak a paragraph with two sentences.
  <p>
    <ssml-s>You say, <ssml-phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</ssml-phoneme>.</ssml-s>
    <ssml-s>I say, <ssml-phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</ssml-phoneme>.</ssml-s>
  </p>
</ssml-speak>
<template id="ssml-controls">
  <style>
    [role="switch"][aria-checked="true"] :first-child,
    [role="switch"][aria-checked="false"] :last-child {
      background: #000;
      color: #fff;
    }
  </style>
  <slot></slot>
  <p>
    <span id="play">Speak</span>
    <button role="switch" aria-checked="false" aria-labelledby="play">
      <span>on</span>
      <span>off</span>
    </button>
  </p>
</template>

Example 19

class SSMLSpeak extends HTMLElement {
  constructor() {
    super();
    const template = document.getElementById('ssml-controls');
    const templateContent = template.content;
    this.attachShadow({mode: 'open'})
      .appendChild(templateContent.cloneNode(true));
  }
  connectedCallback() {
    const button = this.shadowRoot.querySelector('[role="switch"][aria-labelledby="play"]')
    const ssml = this.innerHTML.replace(/ssml-/gm, '')
    const msg = new SpeechSynthesisUtterance();
    msg.lang = document.documentElement.lang;
    msg.text = `<speak version="1.1"
      xmlns="http://www.w3.org/2001/10/synthesis"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
        http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
      xml:lang="${msg.lang}">
    ${ssml}
    </speak>`;
    msg.voice = speechSynthesis.getVoices().find(voice => voice.lang.startsWith(msg.lang));
    msg.onstart = () => button.setAttribute('aria-checked', 'true');
    msg.onend = () => button.setAttribute('aria-checked', 'false');
    button.addEventListener('click', () => speechSynthesis[speechSynthesis.speaking ? 'cancel' : 'speak'](msg))
  }
}

customElements.define('ssml-speak', SSMLSpeak);

5.5 Existing Work

5.6 Problems and Limitations

OS/Browsers combinations that do not support the serialized XML usage of the Web Speech API.
Browsers may need to map SSML tags with CSS styles for default user agent styles.
Without an extension or AT, only user interaction can start the Web Speech API.
Authors or parsing may need to remove HTML content with unintended SSML semantics before serialization.

6. Use Case JSON-LD

6.1 Background and Current Practice

JSON-LD provides an established standard for embedding data in HTML. Unlike other microdata approaches, JSON-LD helps to reuse standardized annotations through external references.

6.2 Goal

Support use of SSML in HTML documents.

6.3 Target Audience

SSML-aware technologies and browser extensions
Search indexers

6.4 Implementation Options

JSON-LD

Example 20

<script type="application/ld+json">
{
  "@context": "http://schema.org/",
  "@id": "/pronunciation#WKRP",
  "@type": "RadioStation",
  "name": ["WKRP",
    "@type": "PronounceableText",
    "textValue": "WKRP",
    "speechToTextMarkup": "SSML",
    "phoneticText": "<speak><say-as interpret-as=\"characters\">WKRP</say-as>"
  ]
}
</script>
<p>
  Do you listen to <span itemscope
    itemtype="http://schema.org/PronounceableText"
    itemid="/pronunciation#WKRP">WKRP</span>?
</p>

6.5 Existing Work

6.6 Problems and Limitations

not an established "type"/published schema

7. Use Case Ruby

7.1 Background and Current Practice

<Ruby> annotations are short runs of text presented alongside base text, primarily used in East Asian typography as a guide for pronunciation or to include other annotations.

ruby guides pronunciation visually. This seems like a natural fit for text-to-speech.

7.2 Goal

Support use of SSML in HTML documents.
Offer visual pronunciation support.

7.3 Target Audience

AT and browser extensions
Search indexers

7.4 Implementation Options

ruby with microdata

Microdata can augment the ruby element and its descendants.

Example 21

<p>
  You say,
  <span itemscope="" itemtype="http://example.org/Pronunciation">
    <ruby itemprop="phoneme" content="pecan">
      pecan
      <rt itemprop="ph">pɪˈkɑːn</rt>
      <meta itemprop="alphabet" content="ipa">
    </ruby>.
  </span>
  I say,
  <span itemscope="" itemtype="http://example.org/Pronunciation">
    <ruby itemprop="phoneme" content="pecan">
      pe
      <rt itemprop="ph">ˈpi</rt>
      can
      <rt itemprop="ph">kæn</rt>
      <meta itemprop="alphabet" content="ipa">
    </ruby>.
  </span>
</p>

7.5 Existing Work

7.6 Problems and Limitations

AT may process annotations as content
AT "double reading" words instead of choosing either the content or the annotation
Only offers for a few SSML expressions
Difficult to reuse by reference