Hypertext-to-Speech and Media Overlays

Hypertext-to-Speech

XHTML and SSML

EPUB 3.2 supports some SSML attributes atop XHTML [1]. Here is a sketch of XHTML+SSML with CSS3 Speech Module [2]:

<html xmlns:ssml="http://www.w3.org/2001/10/synthesis">
  <body ssml:alphabet="ipa">
    <span id="sentence-1">This is a sentence of <span ssml:ph="/tɛkst/" style="voice-duration:0.5s">text</span> with markup for hypertext-to-speech.</span>
  </body>
</html>

Here is a sketch of XHTML+SSML+SMIL.

<html xmlns:ssml="http://www.w3.org/2001/10/synthesis" xmlns:smil="http://www.w3.org/ns/SMIL">
  <body ssml:alphabet="ipa">
    <span id="sentence-1">This is a sentence of <span ssml:ph="/tɛkst/" smil:dur="0.5s">text</span> with markup for hypertext-to-speech.</span>
  </body>
</html>

XHTML, MathML and SSML

MathML includes an annotation framework with <annotation> and <annotation-xml> elements. SSML can be included in MathML annotations.

<html xmlns:ssml="http://www.w3.org/2001/10/synthesis">
  <body>
    <span id="sentence-2">This is a sentence with mathematics <math>...<annotation-xml encoding="application/ssml+xml">...</annotation-xml>...</math> for hypertext-to-speech.</span>
  </body>
</html>

Media Overlays

EPUB 3.2 supports media overlays using SMIL [3].

<smil xmlns="http://www.w3.org/ns/SMIL" version="3.0">
  <body>
    <par id="par1">
      <text src="chapter-1.xhtml#sentence-1"/>
      <audio src="chapter-1_audio.mp3" clipBegin="0s" clipEnd="10s"/>
    </par>
    <par id="par2">
      <text src="chapter-1.xhtml#sentence-2"/>
      <audio src="chapter-1_audio.mp3" clipBegin="10s" clipEnd="20s"/>
    </par>
    <par id="par3">
      <text src="chapter-1.xhtml#sentence-3"/>
      <audio src="chapter-1_audio.mp3" clipBegin="20s" clipEnd="30s"/>
    </par>
  </body>
</smil>

XHTML and SMIL

It is also possible to reference an audio file from XHTML markup while synchronizing content to it.

<html xmlns:smil="http://www.w3.org/ns/SMIL">
  <body smil:src="chapter-1_audio.mp3">
    <span id="sentence-1" smil:clipBegin="0s" smil:clipEnd="10s">This is a sentence of text with markup for hypertext-to-speech.</span>
  </body>
</html>

It is possible to simultaneously markup hypertext for hypertext-to-speech and media overlay scenarios.

<html xmlns:ssml="http://www.w3.org/2001/10/synthesis" xmlns:smil="http://www.w3.org/ns/SMIL">
  <body ssml:alphabet="ipa" smil:src="chapter-1_audio.mp3">
    <span id="sentence-1" smil:clipBegin="0s" smil:clipEnd="10s">This is a sentence of <span ssml:ph="/tɛkst/" style="voice-duration:0.5s">text</span> with markup for hypertext-to-speech.</span>
  </body>
</html>

XHTML, MathML and SMIL

This sketch shows how media overlays can indicate audio for hypertext containing mathematics.

<html xmlns:smil="http://www.w3.org/ns/SMIL">
  <body smil:src="chapter-1_audio.mp3">
    <span id="sentence-2" smil:clipBegin="10s" smil:clipEnd="20s">This is a sentence with mathematics <math>...</math> for hypertext-to-speech.</span>
  </body>
</html>

Prosody

Extensible Markup

Extensible markup can be utilized to style prosody and prosodic intonation (see [2]).

<html xmlns:ext="...">
  <head>
    <style type="text/css">
      @namespace ext url(...);
      ext|em { voice-stress: strong; }
    </style>
  </head>
  <body>
    <span id="sentence-1"><ext:em>This</ext:em> is a sentence of text with markup for hypertext-to-speech.</span>
  </body>
</html>

Semantic Inflection

EPUB 3.2 supports semantic inflection [4]. A similar technology is the role attribute [5]. With suitable attributes adorning document trees in a granular manner, some prosodic intonation or prosodic hints could be styled (see [2]). Stylesheets could describe prosody or provide prosodic hints resulting in more natural sounding speech. In the following sketches, an attribute, semantic, is utilized.

<html>
  <head>
    <style type="text/css">
      [semantic="topic-sentence"] { prosody-hint-a: value; prosody-hint-b: value; }
      [semantic="topic"] { prosody-hint-a: value; prosody-hint-b: value; }
      [semantic="topic-sentence"] [semantic="topic"] { prosody-hint-a: value; prosody-hint-b: value; }
    </style>
  </head>
  <body>
    <span id="sentence-1" semantic="topic-sentence"><span semantic="topic">This</span> is a sentence of text with markup for hypertext-to-speech.</span>
  </body>
</html>

Perhaps some kind of parse trees could provide structure for prosodic intonation.

<html>
  <head>
    <style type="text/css" src="speech.css" />
  </head>
  <body>
    <span id="sentence-1" semantic="S"><span semantic="NP">This</span> <span semantic="VP">is <span semantic="NP"><span semantic="NP">a sentence</span> <span semantic="PP">of <span semantic="NP">text</span></span></span> <span semantic="PP">with <span semantic="NP"><span semantic="NP">markup</span> <span semantic="PP">for <span semantic="NP">hypertext-to-speech</span></span></span></span></span>.</span>
  </body>
</html>

Perhaps extensible markup could be of use for parse trees.

<html xmlns:ext="...">
  <head>
    <style type="text/css" src="speech.css" />
  </head>
  <body>
    <ext:s id="sentence-1" semantic="topic-sentence"><ext:np semantic="topic">This</ext:np> <ext:vp>is <ext:np><ext:np>a sentence</ext:np> <ext:pp>of <ext:np>text</ext:np></ext:pp></ext:np> <ext:pp>with <ext:np><ext:np>markup</ext:np> <ext:pp>for <ext:np>hypertext-to-speech</ext:np></ext:pp></ext:np></ext:pp></ext:vp>.</ext:s>
  </body>
</html>

Perhaps there could be multiple values for a semantic attribute.

<html>
  <head>
    <style type="text/css" src="speech.css" />
  </head>
  <body>
    <span id="sentence-1" semantic="S topic-sentence"><span semantic="NP topic">This</span> <span semantic="VP">is <span semantic="NP"><span semantic="NP">a sentence</span> <span semantic="PP">of <span semantic="NP">text</span></span></span> <span semantic="PP">with <span semantic="NP"><span semantic="NP">markup</span> <span semantic="PP">for <span semantic="NP">hypertext-to-speech</span></span></span></span></span>.</span>
  </body>
</html>

Perhaps there could be multiple semantic attributes.

<html>
  <head>
    <style type="text/css" src="speech.css" />
  </head>
  <body>
    <span id="sentence-1" semantic="S" semantic2="topic-sentence"><span semantic="NP" semantic2="topic">This</span> <span semantic="VP">is <span semantic="NP"><span semantic="NP">a sentence</span> <span semantic="PP">of <span semantic="NP">text</span></span></span> <span semantic="PP">with <span semantic="NP"><span semantic="NP">markup</span> <span semantic="PP">for <span semantic="NP">hypertext-to-speech</span></span></span></span></span>.</span>
  </body>
</html>

Semantic Annotation

Perhaps semantic annotation could be of use for styling prosodic intonation. New CSS selectors would be needed to query semantic graphs annotating and interrelating document elements.

<html>
  <head>
    <style type="text/css" src="speech.css" />
    <script type="application/ld+json">
    [{
       "@id" = "#lexeme-1-1",
       ...
    }]
  </head>
  <body>
    <span id="sentence-1"><span id="lexeme-1-1">This</span> <span id="lexeme-1-2">is</span> <span id="lexeme-1-3">a</span> <span id="lexeme-1-4">sentence</span> <span id="lexeme-1-5">of</span> <span id="lexeme-1-6">text</span> <span id="lexeme-1-7">with</span> <span id="lexeme-1-8">markup</span> <span id="lexeme-1-9">for</span> <span id="lexeme-1-10">hypertext-to-speech</span>.</span>
  </body>
</html>

SSML 2.0

CSS4 Speech Module

Web Speech API 2.0

Here are some ideas with regard to a next version of the Web Speech API [6]:

speechSynthesis.speak('This is a sentence of text.');

speechSynthesis.speak('This is a sentence of text.', 'text/plain');

speechSynthesis.speak('<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US"><body><p>This is a sentence of text.</p></body></html>', 'application/xhtml+xml');

speechSynthesis.speak('<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US"><p><s>This is a sentence of text.</s></p></speak>', 'application/ssml+xml');

speechSynthesis.speak(document);

speechSynthesis.speak(document.getElementById('sentence-1'));

var fragment = document.createDocumentFragment();
...
speechSynthesis.speak(fragment);

var doc = document.implementation.createDocument('http://www.w3.org/1999/xhtml', 'html', null);
...
speechSynthesis.speak(doc);

var doc = document.implementation.createDocument('http://www.w3.org/2001/10/synthesis', 'speak', null);
...
speechSynthesis.speak(doc);