EPUB 3 Text-to-Speech Enhancements 1.0

1. Introduction

1.1 Overview

The need for clear and accurate Text-to-Speech (TTS) rendering of publications is imperative for their readability and comprehension. Unfortunately, the complexities of voicing natural languages and the limitations of built-in vocabularies in TTS engines often leads to incorrect and illegible voicing. Users either have to infer the correct meaning, when possible, or stop reading and have the garbled words spelled out. Anyone who has tried to read educational or instructional material using basic TTS playback will understand the frustration of this experience.

W3C has defined a variety of technologies to aid in improving the voice rendering of markup content: the Synthetic Speech Markup Language [SSML], pronunciation lexicons [PRONUNCIATION-LEXICON], and the CSS Speech module.

SSML and pronunciation lexicons provide enhanced speech rendering. Lexicons are like dictionaries of common terms a TTS engine can use, while SSML provides the ability to add individual voicing for specific phrases. EPUB Creators can use these technologies together or separately depending on the complexity of the text. Despite these advantages, the technologies have not been adapted for easy use within the XHTML and SVG formats that EPUB relies on. This document proposes an approach to enable their authoring and rendering in EPUB Content Documents.

This document also covers the use of CSS Speech for improved aural rendering in EPUB. CSS Speech covers a different domain than SSML and pronunciation lexicons. Instead of controlling the specific voicing of words and phrases, these properties allow EPUB Creators to aspects of the aural playback itself — what text to render, at what volume, with what preferred voice, etc.

This document covers the use of these technologies for rendering by EPUB Reading Systems. Although it is anticipated that general assistive technologies such as screen readers could take advantage of the technologies, use by them is out of scope.

1.2 Background

This section is non-normative.

The EPUB Working Group of the International Digital Publishing Forum (IDPF) first defined a means of integrating the Synthetic Speech Markup Language [SSML] and pronunciation lexicons [PRONUNCIATION-LEXICON] in EPUB 3.0 [EPUBContentDocs-30] so that EPUB Creators could improve the rendering quality of text-to-speech (TTS) playback in Reading Systems. The ability to include cascading style sheets [CSS2] also allowed EPUB Creators to access the in-development speech properties of the CSS Speech module [CSS-Speech-1].

Although there has been some authoring uptake of these technologies, support in Reading Systems has yet to materialize to a level where these technologies are considered stable. Consequently, these technologies are now published as a W3C Working Group Note.

EPUB Creators can continue to use these technologies in their publications, as the move to a Note does not change their validity or affect backward compatibility. Developers of Reading Systems that support TTS playback are also strongly encouraged to implement support. The Working Group will look at standardizing any of the technologies that meet support requirements in future revisions of EPUB 3.

Note

The Specification for Spoken Presentation in HTML [Spoken-HTML] is another initiative in W3C to bring SSML to HTML. It is still too early to determine what effect, if any, it will have on this document. The Working Group will monitor the work and future updates to this Note will reflect any impact it has on Text-to-Speech rendering in EPUB.

1.3 Terminology

This specification uses terminology defined in EPUB 3.3 [EPUB-33]. These terms appear capitalized wherever used.

Only the first instance of a term in a section links to its definition.

In addition, this specification defines the following terms:

Text-to-Speech: The rendering of the textual content of an EPUB Publication by a Reading System as artificial human speech using a synthesized voice.

1.4 Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words MAY, MUST, MUST NOT, SHOULD, and SHOULD NOT in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

2. SSML Attributes

2.1 Introduction

This section is non-normative.

The W3C Speech Synthesis Markup Language [SSML] is a language used for assisting Text-to-Speech (TTS) engines in generating synthetic speech. Although SSML is designed as a standalone document type, it also defines semantics suitable for use within other markup languages.

This specification recasts the [SSML] phoneme element as two attributes — ssml:ph and ssml:alphabet — and makes them available within EPUB Content Documents.

The attributes allow EPUB Creators to specify the proper phonetic pronunciation for uncommon terms that a TTS engine is likely to mispronounce, as well as to disambiguate heteronyms.

2.2 The `ssml:ph` attribute

The ssml:ph attribute specifies a phonemic/phonetic pronunciation of the text represented by its carrying element.

Attribute Name

ph

Namespace

https://www.w3.org/2001/10/synthesis

Usage

EPUB Creators MAY specify on any element in EPUB Content Documents with which they can logically associate a phonetic equivalent (i.e., that has descendant text content that a Text-to-Speech engine would otherwise render).

EPUB Creators MUST NOT specify the attribute on a descendant of an element that already carries this attribute.

Value

A phonemic/phonetic expression, syntactically valid with respect to the phonemic/phonetic alphabet used.

The ssml:ph attribute inherits the authoring requirements of the [SSML] phoneme element's ph attribute.

When the ssml:ph attribute appears on an element that has text node descendants, the corresponding document text to which the pronunciation applies is the string that results from concatenating the descendant text nodes, in document order. The specified phonetic pronunciation must therefore logically match the element's textual data in its entirety (i.e., not just an isolated part of its content).

EPUB Creators SHOULD NOT use the ssml:ph attribute on elements without text content that a Text-to-Speech engine would normally render (e.g., on empty div or span elements). The attribute is not intended to add additional voicing only for TTS playback, and Reading Systems are expected to ignore the attribute if it does not replace text they would normally render.

Note

The ssml:ph attribute does not replace attribute values that carry additional textual information (e.g., alt [HTML] and aria-label [WAI-ARIA]) or link additional textual information (e.g., aria-describedby [WAI-ARIA]).

Similarly, EPUB Creators SHOULD NOT add empty ssml:ph attributes to try and suppress the rendering of text. Reading Systems are expected to ignore empty attributes. (See the aria-hidden attribute [WAI-ARIA] for specifying that content is only for visual rendering.)

Example 1

The following example shows the pronunciation for EPUB added to HTML markup.

<html …
      xmlns:ssml="http://www.w3.org/2001/10/synthesis"
      ssml:alphabet="ipa">
   …
   <body>
      <h1><span ssml:ph="ipʌb">EPUB</span> 3.3</h1>
      …
   </body>
</html>

Example 2

The following example shows the pronunciation for EPUB added to SVG markup.

<svg …
     xmlns:ssml="http://www.w3.org/2001/10/synthesis"
     ssml:alphabet="ipa">
   <title><tspan ssml:ph="ipʌb">EPUB</tspan> 3 … </title>
   …
</svg>

2.3 The `ssml:alphabet` attribute

The ssml:alphabet attribute specifies which phonemic/phonetic pronunciation alphabet is used in the value of the ssml:ph attribute.

Attribute Name: alphabet
Namespace: https://www.w3.org/2001/10/synthesis
Usage: EPUB Creators MAY specify on any element in an EPUB Content Document that can contain descendant text content.
Value: The name of the pronunciation alphabet used to express the value of the ssml:ph attribute.

The ssml:alphabet attribute inherits the authoring requirements of the [SSML] phoneme element's alphabet attribute.

The value of the ssml:alphabet attribute is inherited in the document tree. The pronunciation alphabet used for each ssml:ph attribute value is determined by locating the first occurrence of the ssml:alphabet attribute starting with the element on which the ssml:ph attribute appears, followed by the nearest ancestor element.

EPUB Creators SHOULD ensure that an alphabet is defined in scope for all phonemes expressed in ssml:ph attributes. Interoperability of playback cannot be guaranteed in the absence of a declaration — Reading Systems may apply a default alphabet, for example, or may not voice the phoneme.

Example 3

The following example shows a global declaration for the x-JEITA alphabet on the root html element. It is overridden in the body to switch to IPA.

<html … 
      xmlns:ssml="http://www.w3.org/2001/10/synthesis"
      ssml:alphabet="x-JEITA">
   …
   <body>
   	…
   	   <p><span ssml:alphabet="ipa" ssml:ph="ipʌb">EPUB</span> is an …</p>
   	…
   </body>
</html>

Example 4

The following example shows a global declaration for the x-SAMPA alphabet on the root svg element.

<svg …
      xmlns:ssml="http://www.w3.org/2001/10/synthesis"
      ssml:alphabet="x-sampa">
   <title><tspan ssml:ph="ipVb">EPUB</tspan> Adoption Chart</title>
   …
</svg>

Note

Although the [SSML] specification refers to a registry of alphabets, one has not been published. As the charter of the W3C Voice Browser Working Group has expired, the Working Group does not anticipate the publication of such a registry. EPUB Creators therefore should reference Reading System support documentation to determine what alphabet values they support. Some common alphabets include x-JEITA (also x-JEITA-IT-4002 and x-JEITA-IT-4006) and x-sampa.

3. Pronunciation Lexicons

3.1 Introduction

This section is non-normative.

The W3C Pronunciation Lexicon Specification (PLS) [PRONUNCIATION-LEXICON] defines syntax and semantics for XML-based pronunciation lexicons to be used by Automatic Speech Recognition and Text-to-Speech (TTS) engines.

Pronunciation lexicons allow EPUB Creators to define a single global phonetic pronunciation that Reading Systems can use for all instances of a term instead of having to tag every instance using the SSML attributes. It is a much more efficient way of defining pronunciations for words with only a single pronunciation, or where a particular pronunciation is predominant.

EPUB Creators can use the [HTML] link element and [SVG2] link element to associate one or more lexicons with their respective EPUB Content Document type. When Reading Systems process the documents, they can identify the linked lexicons and use them to initiate text-to-speech playback.

3.2 Lexicon Conformance

A pronunciation lexicon:

MUST meet the conformance constraints for XML documents defined in XML Conformance [EPUB-33].
MUST be valid to the grammar defined in [PRONUNCIATION-LEXICON].

Note

An informative schema for validating lexicons is available at https://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/pls.rng [PRONUNCIATION-LEXICON].

Example 5

The following example shows a pronunciation lexicon for Japanese.

<lexicon
     version="1.0"
     alphabet="ipa"
     xml:lang="en"
     xmlns="http://www.w3.org/2005/01/pronunciation-lexicon">
   <lexeme>
      <grapheme>EPUB</grapheme>
      <phoneme>ipʌb</phoneme>
   </lexeme>
   …
</lexicon>

3.3 Associating with EPUB Content Documents

EPUB Creators MAY associate zero or more pronunciation lexicons [PRONUNCIATION-LEXICON] with an EPUB Content Document.

To associate a pronunciation lexicon with an XHTML Content Document, EPUB Creators MUST use the [HTML] link element. Similarly, to associate a pronunciation lexicon with an SVG Content Document, EPUB Creators MUST use the [SVG2] link element.

For both types of EPUB Content Document, the link element MUST have its rel attribute set to "pronunciation" and its type attribute set to the media type "application/pls+xml".

EPUB Creators SHOULD specify the link element hreflang attribute on each link, and its value MUST match the language for which the pronunciation lexicon is relevant [PRONUNCIATION-LEXICON] when specified.

Example 6

The following example shows two pronunciation lexicons (one for Mandarin and one for Mongolian) associated with an XHTML Content Document.

<html … >    
    <head>
        …
        <link rel="pronunciation" type="application/pls+xml" hreflang="cmn" href="../speech/cmn.pls"/>
        <link rel="pronunciation" type="application/pls+xml" hreflang="mn" href="../speech/mn.pls"/>
    </head>        
    …
</html>

5. Reading System Support

5.1 Introduction

Reading Systems may implement Text-to-Speech playback in different ways depending on the type of engine they use — one might only feed the text content of the document to the engine, for example, while another could support full markup. This document tries to provide flexibility in its requirements to allow for these differences. The only requirement is that the correct rendering behaviour result.

Although this document frames the enhancements in the context of a Reading System with built-in Text-to-Speech rendering capabilities, it is anticipated that any application or assistive technology that can access the markup of an EPUB Publication will be able to use these features to provide improved voice rendering. Ensuring the technologies works with these applications is outside the scope of this work, however.

5.2 Conformance

Reading Systems with Text-to-Speech (TTS) capabilities SHOULD support SSML attributes, pronunciation lexicons and CSS Speech as follows:

SSML

Reading Systems that support SSML:

MUST process the ssml:ph attribute per the requirements for the phoneme element's ph attribute [SSML] with the additional requirements that it:
- MUST ignore ssml:ph attributes whose value is an empty string or consists only of ASCII whitespace [INFRA].
- MUST ignore ssml:ph attributes on elements whose descendant text content is an empty string or consists only of ASCII whitespace [INFRA].
- MUST ignore ssml:ph attributes on elements whose descendant text content represents a fallback.
MUST process the ssml:alphabet attribute per the requirements for the phoneme element's alphabet attribute [SSML].

Pronunciation Lexicons

Reading Systems that support pronunciation lexicons:

MUST process all linked pronunciation lexicons in an EPUB Content Document as defined in [PRONUNCIATION-LEXICON].
MUST apply the supplied lexemes to all text nodes in the EPUB Content Document whose language matches the language for which the pronunciation lexicon is relevant [PRONUNCIATION-LEXICON]. [BCP47] defines the algorithm for matching language tags.

Note

It is not required that the Reading System use a Text-to-Speech engine that supports pronunciation lexicons so long as the lexemes are processed and applied correctly. A Reading System might, for example, transform the lexicon into an alternative dictionary format its TTS engine supports.

SSML and Pronunciation Lexicons

Reading Systems that support SSML and pronunciation lexicons:

MUST let any pronunciation instructions provided via the ssml:ph attribute take precedence in cases where a grapheme element [PRONUNCIATION-LEXICON] matches a text node of an element that carries the ssml:ph attribute.

CSS Speech

This document adds no additional requirements for Reading System support to those defined in [CSS-SPEECH-1].

B. Acknowledgements

This section is non-normative.

The following members of the EPUB 3 Working Group contributed to the development of this specification:

Juliette Alexandria (Access2online Inc.)
Luc Audrain (W3C Invited Expert)
Will AWAD (Newgen Knowledgeworks)
Sofia Bautista (Legible Media Inc.)
Laura Brady (W3C Invited Expert)
Leah Brochu (National Network for Equitable Library Service)
Matthew C. Chan (House of Anansi Press)
Yu-Wei Chang (Taiwan Digital Publishing Forum)
Fred Chasen (Scribd)
Garth Conboy (Google LLC)
Juan Corona (Legible Media Inc.)
Dave Cramer (W3C Invited Expert, chair)
Romain Deltour (DAISY Consortium)
Marisa DeMeglio (DAISY Consortium)
Brady Duga (Google LLC)
Reinaldo Ferraz (NIC.br - Brazilian Network Information Center)
John Foliot (W3C Invited Expert)
Teenya Franklin (Pearson plc)
Hadrien Gardeur (EDRLab)
Matt Garrish (DAISY Consortium)
Jen Goulden (Crawford Technologies)
Ivan Herman (W3C, staff contact)
Tetsu Hoshino (Kodansha, Publishers, Ltd.)
Norikazu Ishizu (Kadokawa Corporation)
Norihito IYENAGA (Kodansha, Publishers, Ltd.)
Ken Jones (Circular Software)
Deborah Kaplan (W3C Invited Expert)
Bill Kasdorf (Book Industry Study Group)
George Kerscher (DAISY Consortium)
Kazuhito Kidachi (Mitsue-Links Co., Ltd.)
Masakazu Kitahara (Voyager Japan, Inc.)
Toshiaki Koike (Voyager Japan, Inc.)
Ryo Kuroda (ACCESS CO., LTD.)
Charles LaPierre (Benetech)
Dan Lazin (Google LLC)
Laurent Le Meur (EDRLab)
Farrah Little (National Network for Equitable Library Service)
Karan Malhotra (Newgen Knowledgeworks)
Makoto Murata (DAISY Consortium)
Cristina Mussinelli (Fondazione LIA)
Yoichiro Nagao (Kodansha, Publishers, Ltd.)
Yoshinori Ohmura (SHUEISHA Inc.)
Rachel Osolen (National Network for Equitable Library Service)
Gregorio Pellegrino (Fondazione LIA)
Vijaya Gowri Perumal (Newgen Knowledgeworks)
Wendy Reid (Rakuten, Inc., chair)
Leonard Rosenthol (Adobe)
Shinobu Sato (Kadokawa Corporation)
Ben Schroeter (Pearson plc)
Daihei Shiohama (MEDIA DO Co., Ltd.)
Tzviya Siegman (Wiley)
Avneesh Singh (DAISY Consortium)
MOTOI SUZUKI (SHUEISHA Inc.)
Yutaka Suzuki (Kadokawa Corporation)
Kyrce Swenson (Pearson plc)
Shinya Takami (Kadokawa Corporation, chair)
Mateus Teixeira (W. W. Norton & Company)
Yukio Tomikura (Kodansha, Publishers, Ltd.)
Aimee Ubbink (Crawford Technologies)
Daniel Weck (DAISY Consortium)
Zheng Xu (Gardenia Corp, Rakuten, Inc.)
Fuqiao Xue (W3C)
Evan Yamanishi (W. W. Norton & Company)
Osamu Yoshiba (Kodansha, Publishers, Ltd.)
Junichi Yoshii (Kodansha, Publishers, Ltd.)
Naomi Yoshizawa (W3C)

C. References

C.1 Normative references

[BCP47]: Tags for Identifying Languages. A. Phillips, Ed.; M. Davis, Ed.. IETF. September 2009. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc5646
[CSS-SPEECH-1]: CSS Speech Module. Daniel Weck. W3C. 10 March 2020. W3C Candidate Recommendation. URL: https://www.w3.org/TR/css-speech-1/
[CSS2]: CSS 2. W3C. W3C Recommendation. URL: https://www.w3.org/TR/CSS2/
[EPUB-33]: EPUB 3.3. Matt Garrish; Ivan Herman; Dave Cramer; Garth Conboy; Marisa DeMeglio; Daniel Weck. W3C. 9 September 2021. W3C Working Draft. URL: https://www.w3.org/TR/epub-33/
[HTML]: HTML Standard. Anne van Kesteren; Domenic Denicola; Ian Hickson; Philip Jägenstedt; Simon Pieters. WHATWG. Living Standard. URL: https://html.spec.whatwg.org/multipage/
[INFRA]: Infra Standard. Anne van Kesteren; Domenic Denicola. WHATWG. Living Standard. URL: https://infra.spec.whatwg.org/
[PRONUNCIATION-LEXICON]: Pronunciation Lexicon Specification (PLS) Version 1.0. Paolo Baggia. W3C. 14 October 2008. W3C Recommendation. URL: https://www.w3.org/TR/pronunciation-lexicon/
[RFC2119]: Key words for use in RFCs to Indicate Requirement Levels. S. Bradner. IETF. March 1997. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc2119
[RFC8174]: Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words. B. Leiba. IETF. May 2017. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc8174
[SSML]: Speech Synthesis Markup Language (SSML) Version 1.1. Daniel Burnett; Zhi Wei Shuang. W3C. 7 September 2010. W3C Recommendation. URL: https://www.w3.org/TR/speech-synthesis11/
[SVG2]: Scalable Vector Graphics (SVG) 2. Amelia Bellamy-Royds; Bogdan Brinza; Chris Lilley; Dirk Schulze; David Storey; Eric Willigers. W3C. 4 October 2018. W3C Candidate Recommendation. URL: https://www.w3.org/TR/SVG2/

C.2 Informative references

[EPUBContentDocs-30]: EPUB Content Documents 3.0. Markus Gylling; William McCoy; Elika J. Etimad; Matt Garrish. IDPF. 11 October 2011. URL: http://idpf.org/epub/30/spec/epub30-contentdocs-20111011.html
[Spoken-HTML]: Specification for Spoken Presentation in HTML. Irfan Ali; Markku Hakkinen; Paul Grenier; Ruoxi Ran. W3C. 18 May 2021. W3C Working Draft. URL: https://www.w3.org/TR/spoken-html/
[WAI-ARIA]: Accessible Rich Internet Applications (WAI-ARIA) 1.0. James Craig; Michael Cooper et al. W3C. 20 March 2014. W3C Recommendation. URL: https://www.w3.org/TR/wai-aria/

EPUB 3 Text-to-Speech Enhancements 1.0

W3C Working Group Note 14 September 2021

Abstract

Status of This Document

1. Introduction

1.1 Overview

1.2 Background

1.3 Terminology

1.4 Conformance

2. SSML Attributes

2.1 Introduction

2.2 The `ssml:ph` attribute

2.3 The `ssml:alphabet` attribute

3. Pronunciation Lexicons

3.1 Introduction

3.2 Lexicon Conformance

3.3 Associating with EPUB Content Documents

4. CSS Speech

5. Reading System Support

5.1 Introduction

5.2 Conformance

A. Change Log

B. Acknowledgements

C. References

C.1 Normative references

C.2 Informative references

EPUB 3 Text-to-Speech Enhancements 1.0

W3C Working Group Note 14 September 2021

Abstract

Status of This Document

1. Introduction

1.1 Overview

1.2 Background

1.3 Terminology

1.4 Conformance

2. SSML Attributes

2.1 Introduction

2.2 The ssml:ph attribute

2.3 The ssml:alphabet attribute

3. Pronunciation Lexicons

3.1 Introduction

3.2 Lexicon Conformance

3.3 Associating with EPUB Content Documents

4. CSS Speech

5. Reading System Support

5.1 Introduction

5.2 Conformance

A. Change Log

B. Acknowledgements

C. References

C.1 Normative references

C.2 Informative references

2.2 The `ssml:ph` attribute

2.3 The `ssml:alphabet` attribute