Hypertext-to-Speech and Media Overlays
Hypertext-to-Speech
XHTML and SSML
EPUB 3.2 supports some SSML attributes atop XHTML [1]. Here is a sketch of XHTML+SSML with CSS3 Speech Module [2]:
<html xmlns:ssml="http://www.w3.org/2001/10/synthesis">
<body ssml:alphabet="ipa">
<span id="sentence-1">This is a sentence of <span ssml:ph="/tɛkst/" style="voice-duration:0.5s">text</span> with markup for hypertext-to-speech.</span>
</body>
</html>
Here is a sketch of XHTML+SSML+SMIL.
<html xmlns:ssml="http://www.w3.org/2001/10/synthesis" xmlns:smil="http://www.w3.org/ns/SMIL">
<body ssml:alphabet="ipa">
<span id="sentence-1">This is a sentence of <span ssml:ph="/tɛkst/" smil:dur="0.5s">text</span> with markup for hypertext-to-speech.</span>
</body>
</html>
XHTML, MathML and SSML
MathML includes an annotation framework with <annotation>
and <annotation-xml>
elements. SSML can be included in MathML annotations.
<html xmlns:ssml="http://www.w3.org/2001/10/synthesis">
<body>
<span id="sentence-2">This is a sentence with mathematics <math>...<annotation-xml encoding="application/ssml+xml">...</annotation-xml>...</math> for hypertext-to-speech.</span>
</body>
</html>
Media Overlays
EPUB 3.2 supports media overlays using SMIL [3].
<smil xmlns="http://www.w3.org/ns/SMIL" version="3.0">
<body>
<par id="par1">
<text src="chapter-1.xhtml#sentence-1"/>
<audio src="chapter-1_audio.mp3" clipBegin="0s" clipEnd="10s"/>
</par>
<par id="par2">
<text src="chapter-1.xhtml#sentence-2"/>
<audio src="chapter-1_audio.mp3" clipBegin="10s" clipEnd="20s"/>
</par>
<par id="par3">
<text src="chapter-1.xhtml#sentence-3"/>
<audio src="chapter-1_audio.mp3" clipBegin="20s" clipEnd="30s"/>
</par>
</body>
</smil>
XHTML and SMIL
It is also possible to reference an audio file from XHTML markup while synchronizing content to it.
<html xmlns:smil="http://www.w3.org/ns/SMIL">
<body smil:src="chapter-1_audio.mp3">
<span id="sentence-1" smil:clipBegin="0s" smil:clipEnd="10s">This is a sentence of text with markup for hypertext-to-speech.</span>
</body>
</html>
It is possible to simultaneously markup hypertext for hypertext-to-speech and media overlay scenarios.
<html xmlns:ssml="http://www.w3.org/2001/10/synthesis" xmlns:smil="http://www.w3.org/ns/SMIL">
<body ssml:alphabet="ipa" smil:src="chapter-1_audio.mp3">
<span id="sentence-1" smil:clipBegin="0s" smil:clipEnd="10s">This is a sentence of <span ssml:ph="/tɛkst/" style="voice-duration:0.5s">text</span> with markup for hypertext-to-speech.</span>
</body>
</html>
XHTML, MathML and SMIL
This sketch shows how media overlays can indicate audio for hypertext containing mathematics.
<html xmlns:smil="http://www.w3.org/ns/SMIL">
<body smil:src="chapter-1_audio.mp3">
<span id="sentence-2" smil:clipBegin="10s" smil:clipEnd="20s">This is a sentence with mathematics <math>...</math> for hypertext-to-speech.</span>
</body>
</html>
Prosody
Extensible Markup
Extensible markup can be utilized to style prosody and prosodic intonation (see [2]).
<html xmlns:ext="...">
<head>
<style type="text/css">
@namespace ext url(...);
ext|em { voice-stress: strong; }
</style>
</head>
<body>
<span id="sentence-1"><ext:em>This</ext:em> is a sentence of text with markup for hypertext-to-speech.</span>
</body>
</html>
Semantic Inflection
EPUB 3.2 supports semantic inflection [4]. A similar technology is the role attribute [5]. With suitable attributes adorning document trees in a granular manner, some prosodic intonation or prosodic hints could be styled (see [2]). Stylesheets could describe prosody or provide prosodic hints resulting in more natural sounding speech. In the following sketches, an attribute, semantic
, is utilized.
<html>
<head>
<style type="text/css">
[semantic="topic-sentence"] { prosody-hint-a: value; prosody-hint-b: value; }
[semantic="topic"] { prosody-hint-a: value; prosody-hint-b: value; }
[semantic="topic-sentence"] [semantic="topic"] { prosody-hint-a: value; prosody-hint-b: value; }
</style>
</head>
<body>
<span id="sentence-1" semantic="topic-sentence"><span semantic="topic">This</span> is a sentence of text with markup for hypertext-to-speech.</span>
</body>
</html>
Perhaps some kind of parse trees could provide structure for prosodic intonation.
<html>
<head>
<style type="text/css" src="speech.css" />
</head>
<body>
<span id="sentence-1" semantic="S"><span semantic="NP">This</span> <span semantic="VP">is <span semantic="NP"><span semantic="NP">a sentence</span> <span semantic="PP">of <span semantic="NP">text</span></span></span> <span semantic="PP">with <span semantic="NP"><span semantic="NP">markup</span> <span semantic="PP">for <span semantic="NP">hypertext-to-speech</span></span></span></span></span>.</span>
</body>
</html>
Perhaps extensible markup could be of use for parse trees.
<html xmlns:ext="...">
<head>
<style type="text/css" src="speech.css" />
</head>
<body>
<ext:s id="sentence-1" semantic="topic-sentence"><ext:np semantic="topic">This</ext:np> <ext:vp>is <ext:np><ext:np>a sentence</ext:np> <ext:pp>of <ext:np>text</ext:np></ext:pp></ext:np> <ext:pp>with <ext:np><ext:np>markup</ext:np> <ext:pp>for <ext:np>hypertext-to-speech</ext:np></ext:pp></ext:np></ext:pp></ext:vp>.</ext:s>
</body>
</html>
Perhaps there could be multiple values for a semantic
attribute.
<html>
<head>
<style type="text/css" src="speech.css" />
</head>
<body>
<span id="sentence-1" semantic="S topic-sentence"><span semantic="NP topic">This</span> <span semantic="VP">is <span semantic="NP"><span semantic="NP">a sentence</span> <span semantic="PP">of <span semantic="NP">text</span></span></span> <span semantic="PP">with <span semantic="NP"><span semantic="NP">markup</span> <span semantic="PP">for <span semantic="NP">hypertext-to-speech</span></span></span></span></span>.</span>
</body>
</html>
Perhaps there could be multiple semantic
attributes.
<html>
<head>
<style type="text/css" src="speech.css" />
</head>
<body>
<span id="sentence-1" semantic="S" semantic2="topic-sentence"><span semantic="NP" semantic2="topic">This</span> <span semantic="VP">is <span semantic="NP"><span semantic="NP">a sentence</span> <span semantic="PP">of <span semantic="NP">text</span></span></span> <span semantic="PP">with <span semantic="NP"><span semantic="NP">markup</span> <span semantic="PP">for <span semantic="NP">hypertext-to-speech</span></span></span></span></span>.</span>
</body>
</html>
Semantic Annotation
Perhaps semantic annotation could be of use for styling prosodic intonation. New CSS selectors would be needed to query semantic graphs annotating and interrelating document elements.
<html>
<head>
<style type="text/css" src="speech.css" />
<script type="application/ld+json">
[{
"@id" = "#lexeme-1-1",
...
}]
</head>
<body>
<span id="sentence-1"><span id="lexeme-1-1">This</span> <span id="lexeme-1-2">is</span> <span id="lexeme-1-3">a</span> <span id="lexeme-1-4">sentence</span> <span id="lexeme-1-5">of</span> <span id="lexeme-1-6">text</span> <span id="lexeme-1-7">with</span> <span id="lexeme-1-8">markup</span> <span id="lexeme-1-9">for</span> <span id="lexeme-1-10">hypertext-to-speech</span>.</span>
</body>
</html>
SSML 2.0
CSS4 Speech Module
Web Speech API 2.0
Here are some ideas with regard to a next version of the Web Speech API [6]:
speechSynthesis.speak('This is a sentence of text.');
speechSynthesis.speak('This is a sentence of text.', 'text/plain');
speechSynthesis.speak('<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US"><body><p>This is a sentence of text.</p></body></html>', 'application/xhtml+xml');
speechSynthesis.speak('<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US"><p><s>This is a sentence of text.</s></p></speak>', 'application/ssml+xml');
speechSynthesis.speak(document);
speechSynthesis.speak(document.getElementById('sentence-1'));
var fragment = document.createDocumentFragment();
...
speechSynthesis.speak(fragment);
var doc = document.implementation.createDocument('http://www.w3.org/1999/xhtml', 'html', null);
...
speechSynthesis.speak(doc);
var doc = document.implementation.createDocument('http://www.w3.org/2001/10/synthesis', 'speak', null);
...
speechSynthesis.speak(doc);
References
- https://w3c.github.io/publ-epub-revision/epub32/spec/epub-contentdocs.html#sec-xhtml-ssml-attrib
- https://drafts.csswg.org/css-speech-1/
- https://w3c.github.io/publ-epub-revision/epub32/spec/epub-mediaoverlays.html
- https://w3c.github.io/publ-epub-revision/epub32/spec/epub-contentdocs.html#sec-xhtml-semantic-inflection
- https://www.w3.org/TR/role-attribute/
- https://w3c.github.io/speech-api/speechapi.html