EPUB 3 Text-to-Speech Enhancements 1.0

W3C Working Group Note

This version:
https://www.w3.org/TR/2021/NOTE-epub-tts-10-20210914/
Latest published version:
https://www.w3.org/TR/epub-tts-10/
Latest editor's draft:
https://w3c.github.io/epub-specs/epub33/tts/
Previous version:
https://www.w3.org/TR/2021/NOTE-epub-tts-10-20210903/
Editor:
Matt Garrish (DAISY Consortium)
Participate:
GitHub w3c/epub-specs
File an issue
Commit history
Pull requests

Abstract

This document describes authoring features and reading system support for improving the voicing of EPUB 3 publications.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.

This document was published by the EPUB 3 Working Group as a Working Group Note.

GitHub Issues are preferred for discussion of this specification. Alternatively, you can send comments to our mailing list. Please send them to public-epub3@w3.org (subscribe, archives).

Publication as a Working Group Note does not imply endorsement by the W3C Membership.

This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the W3C Patent Policy.

This document is governed by the 15 September 2020 W3C Process Document.

1. Introduction

1.1 Overview

The need for clear and accurate Text-to-Speech (TTS) rendering of publications is imperative for their readability and comprehension. Unfortunately, the complexities of voicing natural languages and the limitations of built-in vocabularies in TTS engines often leads to incorrect and illegible voicing. Users either have to infer the correct meaning, when possible, or stop reading and have the garbled words spelled out. Anyone who has tried to read educational or instructional material using basic TTS playback will understand the frustration of this experience.

W3C has defined a variety of technologies to aid in improving the voice rendering of markup content: the Synthetic Speech Markup Language [SSML], pronunciation lexicons [PRONUNCIATION-LEXICON], and the CSS Speech module.

SSML and pronunciation lexicons provide enhanced speech rendering. Lexicons are like dictionaries of common terms a TTS engine can use, while SSML provides the ability to add individual voicing for specific phrases. EPUB Creators can use these technologies together or separately depending on the complexity of the text. Despite these advantages, the technologies have not been adapted for easy use within the XHTML and SVG formats that EPUB relies on. This document proposes an approach to enable their authoring and rendering in EPUB Content Documents.

This document also covers the use of CSS Speech for improved aural rendering in EPUB. CSS Speech covers a different domain than SSML and pronunciation lexicons. Instead of controlling the specific voicing of words and phrases, these properties allow EPUB Creators to aspects of the aural playback itself — what text to render, at what volume, with what preferred voice, etc.

This document covers the use of these technologies for rendering by EPUB Reading Systems. Although it is anticipated that general assistive technologies such as screen readers could take advantage of the technologies, use by them is out of scope.

1.2 Background

This section is non-normative.

The EPUB Working Group of the International Digital Publishing Forum (IDPF) first defined a means of integrating the Synthetic Speech Markup Language [SSML] and pronunciation lexicons [PRONUNCIATION-LEXICON] in EPUB 3.0 [EPUBContentDocs-30] so that EPUB Creators could improve the rendering quality of text-to-speech (TTS) playback in Reading Systems. The ability to include cascading style sheets [CSS2] also allowed EPUB Creators to access the in-development speech properties of the CSS Speech module [CSS-Speech-1].

Although there has been some authoring uptake of these technologies, support in Reading Systems has yet to materialize to a level where these technologies are considered stable. Consequently, these technologies are now published as a W3C Working Group Note.

EPUB Creators can continue to use these technologies in their publications, as the move to a Note does not change their validity or affect backward compatibility. Developers of Reading Systems that support TTS playback are also strongly encouraged to implement support. The Working Group will look at standardizing any of the technologies that meet support requirements in future revisions of EPUB 3.

Note

The Specification for Spoken Presentation in HTML [Spoken-HTML] is another initiative in W3C to bring SSML to HTML. It is still too early to determine what effect, if any, it will have on this document. The Working Group will monitor the work and future updates to this Note will reflect any impact it has on Text-to-Speech rendering in EPUB.

1.3 Terminology

This specification uses terminology defined in EPUB 3.3 [EPUB-33]. These terms appear capitalized wherever used.

Only the first instance of a term in a section links to its definition.

In addition, this specification defines the following terms:

Text-to-Speech

The rendering of the textual content of an EPUB Publication by a Reading System as artificial human speech using a synthesized voice.

1.4 Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words MAY, MUST, MUST NOT, SHOULD, and SHOULD NOT in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

2. SSML Attributes

2.1 Introduction

This section is non-normative.

The W3C Speech Synthesis Markup Language [SSML] is a language used for assisting Text-to-Speech (TTS) engines in generating synthetic speech. Although SSML is designed as a standalone document type, it also defines semantics suitable for use within other markup languages.

This specification recasts the [SSML] phoneme element as two attributes — ssml:ph and ssml:alphabet — and makes them available within EPUB Content Documents.

The attributes allow EPUB Creators to specify the proper phonetic pronunciation for uncommon terms that a TTS engine is likely to mispronounce, as well as to disambiguate heteronyms.

2.2 The ssml:ph attribute

The ssml:ph attribute specifies a phonemic/phonetic pronunciation of the text represented by its carrying element.

Attribute Name

ph

Namespace

https://www.w3.org/2001/10/synthesis

Usage

EPUB Creators MAY specify on any element in EPUB Content Documents with which they can logically associate a phonetic equivalent (i.e., that has descendant text content that a Text-to-Speech engine would otherwise render).

EPUB Creators MUST NOT specify the attribute on a descendant of an element that already carries this attribute.

Value

A phonemic/phonetic expression, syntactically valid with respect to the phonemic/phonetic alphabet used.

The ssml:ph attribute inherits the authoring requirements of the [SSML] phoneme element's ph attribute.

When the ssml:ph attribute appears on an element that has text node descendants, the corresponding document text to which the pronunciation applies is the string that results from concatenating the descendant text nodes, in document order. The specified phonetic pronunciation must therefore logically match the element's textual data in its entirety (i.e., not just an isolated part of its content).

EPUB Creators SHOULD NOT use the ssml:ph attribute on elements without text content that a Text-to-Speech engine would normally render (e.g., on empty div or span elements). The attribute is not intended to add additional voicing only for TTS playback, and Reading Systems are expected to ignore the attribute if it does not replace text they would normally render.

Note

The ssml:ph attribute does not replace attribute values that carry additional textual information (e.g., alt [HTML] and aria-label [WAI-ARIA]) or link additional textual information (e.g., aria-describedby [WAI-ARIA]).

Similarly, EPUB Creators SHOULD NOT add empty ssml:ph attributes to try and suppress the rendering of text. Reading Systems are expected to ignore empty attributes. (See the aria-hidden attribute [WAI-ARIA] for specifying that content is only for visual rendering.)

2.3 The ssml:alphabet attribute

The ssml:alphabet attribute specifies which phonemic/phonetic pronunciation alphabet is used in the value of the ssml:ph attribute.

Attribute Name

alphabet

Namespace

https://www.w3.org/2001/10/synthesis

Usage

EPUB Creators MAY specify on any element in an EPUB Content Document that can contain descendant text content.

Value

The name of the pronunciation alphabet used to express the value of the ssml:ph attribute.

The ssml:alphabet attribute inherits the authoring requirements of the [SSML] phoneme element's alphabet attribute.

The value of the ssml:alphabet attribute is inherited in the document tree. The pronunciation alphabet used for each ssml:ph attribute value is determined by locating the first occurrence of the ssml:alphabet attribute starting with the element on which the ssml:ph attribute appears, followed by the nearest ancestor element.

EPUB Creators SHOULD ensure that an alphabet is defined in scope for all phonemes expressed in ssml:ph attributes. Interoperability of playback cannot be guaranteed in the absence of a declaration — Reading Systems may apply a default alphabet, for example, or may not voice the phoneme.

Note

Although the [SSML] specification refers to a registry of alphabets, one has not been published. As the charter of the W3C Voice Browser Working Group has expired, the Working Group does not anticipate the publication of such a registry. EPUB Creators therefore should reference Reading System support documentation to determine what alphabet values they support. Some common alphabets include x-JEITA (also x-JEITA-IT-4002 and x-JEITA-IT-4006) and x-sampa.

3. Pronunciation Lexicons

3.1 Introduction

This section is non-normative.

The W3C Pronunciation Lexicon Specification (PLS) [PRONUNCIATION-LEXICON] defines syntax and semantics for XML-based pronunciation lexicons to be used by Automatic Speech Recognition and Text-to-Speech (TTS) engines.

Pronunciation lexicons allow EPUB Creators to define a single global phonetic pronunciation that Reading Systems can use for all instances of a term instead of having to tag every instance using the SSML attributes. It is a much more efficient way of defining pronunciations for words with only a single pronunciation, or where a particular pronunciation is predominant.

EPUB Creators can use the [HTML] link element and [SVG2] link element to associate one or more lexicons with their respective EPUB Content Document type. When Reading Systems process the documents, they can identify the linked lexicons and use them to initiate text-to-speech playback.

3.2 Lexicon Conformance

A pronunciation lexicon:

Note

An informative schema for validating lexicons is available at https://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/pls.rng [PRONUNCIATION-LEXICON].

3.3 Associating with EPUB Content Documents

EPUB Creators MAY associate zero or more pronunciation lexicons [PRONUNCIATION-LEXICON] with an EPUB Content Document.

To associate a pronunciation lexicon with an XHTML Content Document, EPUB Creators MUST use the [HTML] link element. Similarly, to associate a pronunciation lexicon with an SVG Content Document, EPUB Creators MUST use the [SVG2] link element.

For both types of EPUB Content Document, the link element MUST have its rel attribute set to "pronunciation" and its type attribute set to the media type "application/pls+xml".

EPUB Creators SHOULD specify the link element hreflang attribute on each link, and its value MUST match the language for which the pronunciation lexicon is relevant [PRONUNCIATION-LEXICON] when specified.

4. CSS Speech

The CSS Speech [CSS-SPEECH-1] module defines properties that allow EPUB Creators to declaratively control the aural rendering of EPUB Content Documents. It includes properties for specifying the preferred Text-to-Speech voice, the volume level, and pauses and cues to perform when encountering elements.

As EPUB Content Documents support the use of cascading style sheets [CSS2], EPUB Creators MAY use CSS Speech [CSS-SPEECH-1] properties in their style sheet definitions.

5. Reading System Support

5.1 Introduction

Reading Systems may implement Text-to-Speech playback in different ways depending on the type of engine they use — one might only feed the text content of the document to the engine, for example, while another could support full markup. This document tries to provide flexibility in its requirements to allow for these differences. The only requirement is that the correct rendering behaviour result.

Although this document frames the enhancements in the context of a Reading System with built-in Text-to-Speech rendering capabilities, it is anticipated that any application or assistive technology that can access the markup of an EPUB Publication will be able to use these features to provide improved voice rendering. Ensuring the technologies works with these applications is outside the scope of this work, however.

5.2 Conformance

Reading Systems with Text-to-Speech (TTS) capabilities SHOULD support SSML attributes, pronunciation lexicons and CSS Speech as follows:

SSML

Reading Systems that support SSML:

Pronunciation Lexicons

Reading Systems that support pronunciation lexicons:

Note

It is not required that the Reading System use a Text-to-Speech engine that supports pronunciation lexicons so long as the lexemes are processed and applied correctly. A Reading System might, for example, transform the lexicon into an alternative dictionary format its TTS engine supports.

SSML and Pronunciation Lexicons

Reading Systems that support SSML and pronunciation lexicons:

  • MUST let any pronunciation instructions provided via the ssml:ph attribute take precedence in cases where a grapheme element [PRONUNCIATION-LEXICON] matches a text node of an element that carries the ssml:ph attribute.

CSS Speech

This document adds no additional requirements for Reading System support to those defined in [CSS-SPEECH-1].

A. Change Log

This section is non-normative.

Note that this change log only identifies substantive changes since EPUB Content Documents 3.2 — those that affect conformance or are similarly noteworthy.

For a list of all issues addressed during the revision, refer to the Working Group's issue tracker.

B. Acknowledgements

This section is non-normative.

The following members of the EPUB 3 Working Group contributed to the development of this specification:

C. References

C.1 Normative references

[BCP47]
Tags for Identifying Languages. A. Phillips, Ed.; M. Davis, Ed.. IETF. September 2009. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc5646
[CSS-SPEECH-1]
CSS Speech Module. Daniel Weck. W3C. 10 March 2020. W3C Candidate Recommendation. URL: https://www.w3.org/TR/css-speech-1/
[CSS2]
CSS 2. W3C. W3C Recommendation. URL: https://www.w3.org/TR/CSS2/
[EPUB-33]
EPUB 3.3. Matt Garrish; Ivan Herman; Dave Cramer; Garth Conboy; Marisa DeMeglio; Daniel Weck. W3C. 9 September 2021. W3C Working Draft. URL: https://www.w3.org/TR/epub-33/
[HTML]
HTML Standard. Anne van Kesteren; Domenic Denicola; Ian Hickson; Philip Jägenstedt; Simon Pieters. WHATWG. Living Standard. URL: https://html.spec.whatwg.org/multipage/
[INFRA]
Infra Standard. Anne van Kesteren; Domenic Denicola. WHATWG. Living Standard. URL: https://infra.spec.whatwg.org/
[PRONUNCIATION-LEXICON]
Pronunciation Lexicon Specification (PLS) Version 1.0. Paolo Baggia. W3C. 14 October 2008. W3C Recommendation. URL: https://www.w3.org/TR/pronunciation-lexicon/
[RFC2119]
Key words for use in RFCs to Indicate Requirement Levels. S. Bradner. IETF. March 1997. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc2119
[RFC8174]
Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words. B. Leiba. IETF. May 2017. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc8174
[SSML]
Speech Synthesis Markup Language (SSML) Version 1.1. Daniel Burnett; Zhi Wei Shuang. W3C. 7 September 2010. W3C Recommendation. URL: https://www.w3.org/TR/speech-synthesis11/
[SVG2]
Scalable Vector Graphics (SVG) 2. Amelia Bellamy-Royds; Bogdan Brinza; Chris Lilley; Dirk Schulze; David Storey; Eric Willigers. W3C. 4 October 2018. W3C Candidate Recommendation. URL: https://www.w3.org/TR/SVG2/

C.2 Informative references

[EPUBContentDocs-30]
EPUB Content Documents 3.0. Markus Gylling; William McCoy; Elika J. Etimad; Matt Garrish. IDPF. 11 October 2011. URL: http://idpf.org/epub/30/spec/epub30-contentdocs-20111011.html
[Spoken-HTML]
Specification for Spoken Presentation in HTML. Irfan Ali; Markku Hakkinen; Paul Grenier; Ruoxi Ran. W3C. 18 May 2021. W3C Working Draft. URL: https://www.w3.org/TR/spoken-html/
[WAI-ARIA]
Accessible Rich Internet Applications (WAI-ARIA) 1.0. James Craig; Michael Cooper et al. W3C. 20 March 2014. W3C Recommendation. URL: https://www.w3.org/TR/wai-aria/