Speech Synthesis Markup Language Specification for the Speech Interface Framework

W3C Working Draft 08 August 2000

This version:: http://www.w3.org/TR/2000/WD-speech-synthesis-20000808
Latest version:: http://www.w3.org/TR/speech-synthesis
Editors:: Mark R. Walker, Intel; Andrew Hunt, SpeechWorks International

Abstract

The W3C Voice Browser working group aims to develop specifications to enable access to the Web using spoken interaction. This document is part of a set of specifications for voice browsers, and provides details of an XML markup language for controlling speech synthesisers.

This document describes a XML markup language for generating synthetic speech via a speech synthesiser. Such synthesisers embody rich knowledge about how to render text, and the role of the markup language is to give authors a standard way to control aspects such as volume, pitch, rate and other properties.

Status of this Document

This document is a W3C Working Draft for review by W3C members and other interested parties. It is a draft document and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". A list of current public W3C Working Drafts can be found at http://www.w3.org/TR.

This specification describes markup for generating synthetic speech via a speech synthesiser, and forms part of the proposals for the W3C Speech Interface Framework. This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group (W3C Members only). This document is for public review, and comments and discussion are welcomed on the public mailing list <www-voice@w3.org>. To subscribe, send an email to <www-voice-request@w3.org> with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). The archive for the list is acccessible online.

1. Introduction
2. Elements and Attributes
- Document Structure, Text Processing and Pronunciation
- 2.1 "speak" Root Element
- 2.2 "xml:lang" Language Attribute
- 2.3 "paragraph" and "sentence"
- 2.4 "sayas" Element
- 2.5 "phoneme" Element
- Prosody and Style
- 2.6 "voice" Element
- 2.7 "emphasis" Element
- 2.8 "break" Element
- 2.9 "prosody" Element
- Other Elements
- 2.10 "audio" Element
- 2.11 "mark" Element
- 2.12 Miscellaneous relevant XML features
3. Future Study
4. Examples
5. DTD for the Speech Synthesis Markup Language
6. Further Reading
7. Acknowledegements

1. Introduction

This document is a specification for a Speech Synthesis Markup Language. This markup language is intended for use by systems that need to produce computer-generated speech output such as Voice Browsers, web browsers and accessible applications. The language provides a set of elements that are focussed on the specific challenges of automatically producing natural-sounding, understandable speech output.

The W3C Standard is known as the Speech Recognition Grammar Specification and is based upon the JSML specification, which is owned by Sun Microsystems, Inc., California, U.S.A.

Sun, Sun Microsystems, Inc., the Sun logo, Java and all Java-based marks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. ©Sun Microsystems.

1.1 Terminology and Design Concepts

There is some variance in the use of terminology in the speech synthesis community. The following definitions establish a common understanding for this document.

Voice Browser	A device which interprets a (voice) markup language and is capable of generating voice output and/or interpreting voice input, and possibly other input/output modalities.
Speech Synthesis	The process of automatic generation of speech output from data input which may include plain text, formatted text or binary objects.
Text-To-Speech	The process of automatic generation of speech output from text or annotated text input.

The design and standardization process has followed from the Speech Synthesis Markup Requirements for Voice Markup Languages published December 23, 1999 by the W3C Voice Browser Working Group.

The following items were the key design criteria.

Consistency: provide predictable control of voice output across platforms and across speech synthesis implementations.
Interoperability: support use along with other W3C specifications including (but not limited to) the Dialog Markup Language, Audio Cascading Style Sheets and SMIL.
Generality: support speech output for a wide range of applications with varied speech content.
Internationalization: Enable speech output in a large number of languages within or across documents.
Generation and Readability: Support automatic generation and hand authoring of documents. The documents should be human-readable.
Implementable: The specification should be implementable with existing, generally available technology and the number of optional features should be minimal.

1.2 Speech Synthesis Processes

A Text-To-Speech (TTS) system that supports the Speech Synthesis Markup Language will be responsible for rendering a document as spoken output and for using the information contained in the markup to render the document as intended by the author.

Document creation: A text document provided as input to the TTS system may be produced automatically, by human-authoring or through a combination of these forms. The Speech Synthesis markup language defines the form of the document.

Document processing: The following are the six major processing steps undertaken by a TTS system to convert marked-up text input into automatically-generated voice output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control the final voice output.

XML Parse: An XML parser is used to extract the document tree and content from the incoming text document. The structure, tags and attributes obtained in this step influence each of the following steps.
Structure analysis: The structure of a document influences the way in which a document should be read. For example, there are common speaking patterns associated with paragraphs and sentences.

- Markup support: The "paragraph" and "sentence" elements defined in the TTS markup language explicitly indicate document structures that affect the speech output.

- Non-markup behavior: In documents and parts of documents where these elements are not used, the TTS system is responsible for inferring the structure by automated analysis of the text, often using punctuation and other language-specific data.
Text normalization: All written languages have special constructs that require a conversion of the written form (orthographic form) into the spoken form. Text normalization is an automated process of the TTS system that performs this conversion. For example, for English, when "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two" and so on.

- Markup support: The "sayas" element can be used in the input document to explicitly indicate the presence and type of these constructs and to resolve ambiguities. The set of constructs that can be marked includes dates, times, numbers, acronyms, current amounts and more. The set covers many of the common constructs that require special treatment across a wide number of languages but is not and cannot be a complete set.

- Non-markup behavior: For text content that is not marked with the "sayas" element the TTS system is expected to make a reasonable effort to automatically locate and convert these constructs to a speakable form. Because of inherent ambiguities (such as the "1/2" example above) and because of the wide range of possible constructs in any language, this process may introduce errors in the speech output and may cause different systems to render the same document differently.
Text-to-phoneme conversion: Once the system has determined the set of words to be spoken it must convert those words to a string of phonemes. A phoneme is the basic unit of sound in a language. Each language (and sometimes each national or dialect variant of a language) has a specific phoneme set: e.g. most US English dialects have around 45 phonemes. In many languages this conversion is ambiguous since the same written word may have many spoken forms. For example, in English, "read" may be spoken as "reed" (I will read the book) or "red" (I have read the book). Another issue is the handling of words with non-standard spellings or pronunciations. For example, an English TTS system will often have trouble determining how to speak some non-English-origin names; e.g. "Tlalpachicatl" which has a Mexican/Aztec origin.

- Markup support: The "phoneme" element allows a phonemic sequence to be provided for any word or word sequence. This provides the content creator with explicit control over pronunciations. The "sayas" element may also be used to indicate that text is a proper name that may allow a TTS system to apply special rules to determine a pronunciation.

- Non-markup behavior: In the absence of a "phoneme" element the TTS system must apply automated capabilities to determine pronunciations. This is typically achieved by looking up words in a pronunciation dictionary and applying rules to determine other pronunciations. Most TTS systems are expert at performing text-to-phoneme conversions so most words of most documents can be handled automatically.
Prosody analysis: Prosody is the set of features of speech output that includes the pitch (also called intonation or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words and many other features. Producing human-like prosody is important for making speech sound natural and for correctly conveying the meaning of spoken language.
- Markup support: The "emphasis" element, "break" element and "prosody" element may all be used by document creators to guide the TTS system is generating appropriate prosodic features in the speech output. The "lowlevel" element (under Future Study) could provide particularly precise control of the prosodic analysis.

- Non-markup behavior: In the absence of these elements, TTS systems are expert (but not perfect) in automatically generating suitable prosody. This is achieved through analysis of the document structure, sentence syntax, and other information that can be inferred from the text input.
Waveform production: The phonemes and prosodic information are used by the TTS system in the production of the audio waveform. There are many approaches to this processing step so there may be considerable platform-specific variation.

- Markup support: The TTS markup does not provide explicit controls over the generation of waveforms. The "voice" element allows the document creator to request a particular voice or specific voice qualities (e.g. a young male voice). The "audio" element allows for insertion of recorded audio data into the output stream.

1.3 Document Generation, Applications and Contexts

There are many classes of document creator that will produce marked-up documents to be spoken by a TTS system. Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in the previous section. The following are some of the common cases.

The document creator has no access to information to mark up the text. All processing steps in the TTS system must be performed fully automatically on raw text. The document requires only the containing "speak" element to indicate the content is to be spoken.
When marked text is generated programmatically the creator may have specific knowledge of the structure and/or special text constructs in some or all of the document. For example, an email reader can mark the location of the time and date of receipt of email. Such applications may use elements that affect structure, text normalization, prosody and possibly text-to-phoneme conversion.
Some document creators make considerable effort to mark as many details of the document to ensure consistent speech quality across platforms and to more precisely specify output qualities. In these cases, the markup may use any or all of the available elements to tightly control the speech output. For example, prompts generated in telephony and voice browser applications may be fine-tuned to maximize the effectiveness of the overall system.
The most advanced document creators may skip the higher-level markup (structure, text normalization, text-to-phoneme conversion, and prosody analysis) and produce low-level TTS markup for segments of documents or for entire documents (this capability is being considered for Future Study and is not formally part of this specification). This typically requires tools to generate sequences of phonemes, plus pitch and timing information. For instance, tools that do "copy synthesis" or "prosody transplant" try to emulate human speech by copying properties from recordings.

The following are important instances of architectures or designs from which marked-up TTS documents will be generated. The language design is intended to facilitate each of these approaches.

Dialog language: It is a requirement that it should be possible to include documents marked with the speech synthesis markup language into the dialog description document to be produced by the Voice Browser Working Group.
Interoperability with Aural CSS: The speech synthesis markup language is a final form representation that can be produced when XSLT is applied to XHTML with ACSS. ACSS is covered in Section 19 of the Cascading Style Sheets, level 2 CSS2 Specification (12-May-1998). This usage of speech synthesis facilitates improved accessibility to existing HTML and XHTML content.
Application-specific style-sheet processing: As mentioned above, there are classes of application that have knowledge of text content to be spoken and this can be incorporated into the speech synthesis markup to enhance rendering of the document. In many cases, it is expected that the application will use style-sheets to perform transformations of existing XML documents to speech synthesis markup. This is equivalent to the use of ACSS with HTML and once again the speech synthesis markup language is the "final form" representation to be passed to the speech synthesis engine.

2. Elements and Attributes

The following elements are defined in this draft specification.

2.1 "speak" Root Element

The Speech Synthesis Markup Language is an XML application. The root element is "speak".

     <?xml version="1.0"?>
     <speak>
     ... the body ...
     </speak>

Relevant requirements: 1.2, 2.1, 2.3

2.2 "xml:lang" Attribute: Language

Following the XML convention, languages are indicated by an "xml:lang"attribute on the enclosing element with the value following RFC 1766 to define language codes. Language information is inherited down the document hierarchy, i.e. it has to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.

<speak xml:lang="en-US">
  <para>I don't speak Japanese.</para>
  <para xml:lang="ja">Nihongo-ga wakarimasen.</para>
</speak>

Usage note 1: The speech output platform determines behavior in the case that a document requires speech output in a language not supported by the speech output platform.

Usage note 2: There may be variation across platforms in the implementation of "xml:lang" for different markup elements. It is reasonable to expect the speech output platform to support a change of language at the document, paragraph and sentence levels. A document author should beware that intra-sentential language changes may not be supported on all platforms.

Usage note 3: A language change often necessitates a change in the voice. Where the platform does not have the same voice in both the enclosing and enclosed languages it should select a new voice with the inherited voice attributes. Any change in voice will reset the prosodic attributes to the default values for the new voice of the enclosed text. Where the "xml:lang" value is the same as the inherited value there is no need for any changes in the voice or prosody.

Usage note 4: All elements should process their contents specific to the enclosing language. For instance, the phoneme, emphasis and break element should each be rendered in a manner that is appropriate to the current language.

Usage note 5: Unsupported languages could be handledby specifying nothing and relying on platform behavior, issuing an event to the host environment, or by providing substitute text in the ML.

Relevant requirements: 1.4, 3.2, 3.3

2.3 "paragraph" and "sentence": Text Structure Elements

A "paragraph" element represents the paragraph structure in text. A "sentence"element represents the sentence structure in text. A paragraph contains zero or more sentences.

<paragraph>
  <sentence>This is the first sentence of the paragraph.</sentence>
  <sentence>Here's another sentence.</sentence>
</paragraph>

Usage note 1: For brevity, the markup also supports <p> and <s> as exact equivalents of <paragraph> and <sentence>. (Note: XML requires that the opening and closing elements be identical so <p> text </paragraph> is not legal.). Also note that <s> means "strike-out" in HTML 4.0 and earlier, and in XHTML-1.0-Transitional but not in XHTML-1.0-Strict.

Usage note 2: The use of paragraph and sentence elements is optional. Where text occurs without an enclosing paragraph or sentence elements the speech output system should attempt to determine the structure using language-specific knowledge of the format of plain text.

Relevant requirements: 3.1

2.4 "sayas" Element

The "sayas" element indicates the type of text construct contained within the element. This information is for use by the text preprocessor of the speech synthesizer which is responsible for determining the words to be spoken. Defining a comprehensive set of text format types is difficult because of the variety of languages that must be considered and because of the inate flexibility of written languages. The "sayas" element has been specified with a reasonable set of format types. Text substitution may be utilized for unsupported constructs.

The "type" attribute is a required attribute that indicates the contained text construct. The format is a text type optionally followed by a colon and a format. The base set of type values, divided according to broad functionality, is as follows:

Pronunciation Types

acronym: contained text is an acronym. The string of characters in the contained text are pronounced as individual characters.
sub: contained text is substituted for pronunciation with the specified text. This allows a document to contain both a spoken and written form.

<sayas type="acronym"> USA </sayas>
<!-- U. S. A. -->
<sayas sub="World Wide Web Consortium">W3C</sayas>

Numerical Types

number: contained text contains integers, fractions, floating points, Roman numerals or some other textual format that can be interpreted and spoken as a number in the current language. Format values for numbers are: "ordinal", where the contained text should be interpreted as an ordinal. The content may be a digit sequence or some other textual format that can be interpreted and spoken as an ordinal in the current language; and "digits", where the contained text is to be read as a digit sequence, rather than as a number.

Rocky <sayas type="number"> XIII </sayas>
<!-- Rocky thirteen -->
Pope John the <sayas type="number:ordinal"> VI </sayas>
<!-- Pope John the sixth -->
Deliver to <sayas type="number:digits"> 123 </sayas> Brookwood.
<!-- Deliver to one two three Brookwood-->

Time, Date and Measure Types

date: contained text is a date. Format values for dates are: "dmy", "mdy", "ymd", "ym", "my", "md", "y". Where the format is omitted, the speech synthesizer should apply localized rules for processing the date.
time: contained text is a time of day. Format values for times are: "hm", and "hms".
duration: contained text is a temporal duration. Format values for durations are: "hm", "hms", "ms" etc.
currency: contained text is a currency amount. Leading and trailing currency symbols are ignored.
measure: contained text is a measurement.

<sayas type="date:ymd"> 2000/1/20 </sayas>
<!-- January 20th two thousand -->
Proposals are due in <sayas type="date:my"> 5/2001 <sayas/>
<!-- Proposals are due in May two thousand and one -->
<sayas type="currency"> $20.45 </sayas>
<!-- twenty dollars and forty five cents -->

Address, Name, Net Types

name: contained text is a proper name of a person, company etc.
net: contained text is an internet handle. Format values for net are: "email", "url".
address: contained text is a postal address.

<sayas type="net:email"> road.runner@acme.com </sayas>

Usage note 1: The conversion of the various types of text and text markup to spoken forms is language and platform-dependent. For example, <sayas type="date:ymd"> 2000/1/20 <sayas> may be read as "January twentieth two thousand" or as "the twentieth of January two thousand" and so on. The markup examples above are provided for usage illustration purposes only.

Usage note 2: When the TTS system is unable to interpret the contents of a "sayas" element as the indicated type, it may choose to ignore the tag and apply default text processing capabilities.

Usage note 3: The "sayas" element can be only be used when the document creator (human or machine) is aware that a particular text construct exists in a document and when the type is known. Document creators without this knowledge can provide only unmarked (raw) text. Certain document creators have knowledge about both the presence of text constructs and how they want them to be spoken. They can perform direct substitution of text or can use the "phoneme" or "lowlevel" elements.

Usage note 4: It is assumed that pronunciations generated by the use of explicit text markup always take precedence over pronunciations produced by a lexicon.

Relevant requirements: 3.8, 3.9

2.5 "phoneme" Element

The "phoneme" element provides a phonetic pronunciation for the contained text. The phonetic string is provided in the required "ph"attribute. The "phoneme" element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.

The phonetic string may use the International Phonetic Alphabet (IPA). The Unicode character set contains the complete IPA character set as symbols U+0250 to U+02AF plus certain Latin and diacritic characters.

<phoneme ph="t&#252;m&#251;to&#28A;"> tomato </phoneme>
<!-- This is an example of IPA using character entities -->

<phoneme ph="tümûto"> tomato </phoneme>
<!-- This example uses the Unicode IPA characters. -->
<!-- Note: this will not display correctly on most browsers -->

Usage note 1: Characters composing many of the IPA phonemes are known to display improperly on most platforms.

Usage note 2: Entity definitions may be used for repeated pronunciations. For example:

<!ENTITY uk_tomato "t&#252;m&#251;to&#28A;">
... you say <phoneme ph="&uk_tomato;"> tomato </phoneme> I say...

Usage note 3: In addition to an exhaustive set of vowel and consonant symbols, IPA supports a syllable delimiter, numerous diacritics, stress symbols, lexical tone symbols, intonational markers and more.

Relevant requirements: 3.4, 3.5 (3.7)

2.6 "voice" Element

The "voice" element is a production element that requests a change in speaking voice. Attributes are:

gender: optional attribute indicating the preferred gender of the voice to speak the contained text. Values are "male", "female", "neutral".
age: optional attribute indicating the preferred age of the voice to speak the contained text. Values are (integer) , "child" , "teenager" , "adult", "elder".
variant: optional attribute indicating a preferred variant of the other voice characteristics to speak the contained text. (e.g. the second or next male child voice). Values are (integer), "different"
name: optional attribute indicating a platform-specific voice name to speak the contained text. The value may be a space-separated list of names ordered from top preference down. Values are (voice-name), "default". The "default" value requests that the synthesizer use its prefered voice, typically the best quality voice for the current language.

<voice gender="female" age="child">Mary had a
little lamb,</voice>

<!-- now request a different female child's voice -->
<voice gender="female" age="child" variant="2">It's fleece
was white as snow.</voice>

<!-- platform-specific voice selection -->
<voice name="Mike">I want to be like Mike.</voice>

Usage note 1: When there is not a voice available that exactly matches the attributes specified in the document, the voice selection algorithm may be platform-specific.

Usage note 2: Voice attributes are inherited down the tree including to within elements that change the language.

<voice gender="female"> 
  Any female voice here.
  <voice age="child"> 
    A female child voice here.
    <paragraph xml:lang="ja"> 
      <!-- A female child voice in Japanese. -->
    </paragraph>
  </voice>
</voice>

Usage note 3: A change in voice resets the prosodic parameters since different voices have different natural pitch and speaking rates. Volume is the only exception. It may be possible to preserve prosodic parameters across a voice change by employing a style sheet. Characteristics specified as "+" or "-" voice attributes with respect to absolute voice attributes would not be preserved.

Usage note 4: The "xml:lang" attribute may be used specially to request usage of a voice with a specific dialect or other varient of the enclosing language.

<voice xml:lang="en-cockney">Try a Cockney voice
(London area).</voice>

<voice xml:lang="en-brooklyn">Try one New York
accent.</voice>

Relevant requirements: 4.1

2.7 "emphasis" Element

The "emphasis" element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The synthesizer determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:

level: the "level" attribute indicates the strength of emphasis to be applied. Defined values are "strong", "moderate", "none" and "reduced". The default level is "moderate". The meaning of "strong" and "moderate" emphasis is interpreted according to the language being spoken (languages indicate emphasis using a possible combination of pitch change, timing changes, loudness and other acoustic differences). The "reduced" level is effectively the opposite of emphasising a word. For example, when the phrase "going to" is reduced it may be spoken as "gonna". The "none" level is used to prevent the speech synthesizer from emphasising words that it might typically emphasise.

That is a <emphasis> big </emphasis> car!
That is a <emphasis level="strong"> huge </emphasis>
bank account!

Relevant requirements: 4.2

2.8 "break" Element

The "break" element is an empty element that controls the pausing or other prosodic boundaries between words. The use of the break element between any pair of words is optional. If the element is not defined, the speech synthesizer is expected to automatically determine a break based on the linguistic context. In practice, the "break"element is most often used to override the typical automatic behavior of a speech synthesizer. The attributes are:

size: the "size" attribute is an optional attribute having one of the following relative values: "none", "small", "medium" (default value), or "large". The value "none" indicates that a normal break boundary should be used. The other three values indicate increasingly large break boundaries between words. The larger boundaries are typically accompanied by pauses.
time: the "time" attribute is an optional attribute indicating the duration of a pause in seconds or milliseconds. It follows the "Times" attribute format from the Cascading Style Sheet Specification. e.g. "250ms", "3s".

Take a deep breath <break/>
then continue. 

Press 1 or wait for the tone. <break time="3s"/>
I didn't hear you!

Usage note 1: Using the "size" attribute is generally preferable to the "time" attribute within normal speech. This is because the speech synthesizer will modify the properties of the break according to the speaking rate, voice and possibly other factors. As an example, a fixed 250ms pause (placed with the "time" attribute) sounds much longer in fast speech than in slow speech.

Relevant requirements: none

2.9 "prosody" Element

The "prosody" element permits control of the pitch, speaking rate and volume of the speech output. The attributes are:

pitch: the baseline pitch for the contained text in Hertz, a relative change or values "high", "medium", "low", "default".
contour: sets the actual pitch contour for the contained text. The format is outlined below.
range: the pitch range (variability) for the contained text in Hertz, a relative change or values "high", "medium", "low", "default".
rate: the speaking rate in words per minute for the contained text, a relative change or values "fast", "medium", "slow", "default".
duration: a value in seconds or milliseconds for the desired time to take to read the element contents. Follows the Times attribute format from the Cascading Style Sheet Specification. e.g. "250ms", "3s".
volume: the volume for the contained text in the range 0.0 to 100.0, a relative change or values "silent", "soft", "medium", "loud" or "default".

Relative values

The relative changes for any of the attributes above can be "+10", "-5.5", "+15%", "-8%". For the pitch and range attributes, relative changes in semitones are permitted: "+5st", "-2st". Since speech synthesizers are not able to apply arbitrary prosodic values they may set limits on the values.

The price of XYZ is <prosody rate="-10%">
<sayas type="currency">$45</sayas></prosody>

Pitch contour

The pitch contour is defined as a set of targets at specified intervals in the speech output. The algorithm for interpolating between the targets is platform-specific. In each pair of the form (interval,target), the first value is a percentage of the period of the contained text and the second value is legal value of the "pitch"attribute (absolute, relative, relative semitone, or descriptive values are all permitted). Interval values outside 0% to 100% are ignored. If a value is not defined for 0% or 100% then the nearest pitch target is copied.

<prosody contour="(0%,+20)(10%,+30%)(40%,+10)">
 good morning
</prosody>

Usage note 1: The descriptive values ("high", "medium" etc.) may be specific to the platform, to user preferences or to the current language and voice. As such, it is generally preferable to use the descriptive values or the relative changes over absolute values.

Usage note 2: The default value of all prosodic attributes is no change. For example, omitting the rate attribute means that the rate is the same within the element as outside.

Usage note 3: The "duration" attribute takes precedence over the "rate" attribute. The "contour" attribute takes precedence over the "pitch" and "range" attributes.

Usage note 4: All prosodic attribute values are indicative: if a speech synthesizer is unable to accurately render a document as specified it will make a best effort (e.g. trying to set the pitch to 1Mhz, or the speaking rate to 1,000,000 words per minute.)

Relevant requirements: 4.4, (4.3)

2.10 "audio" Element

The "audio" element supports the insertion of recorded audio files and the insertion of other audio formats in conjunction with synthesized speech output. The audio element may be empty. If the audio element is not empty then the contents should be the marked-up text to be spoken if the audio document is not available. The contents may also be used when rendering the document to non-audible output and for accessibility. The required attribute is "src", which is the URI of a document with an appropriate mime-type.

<!-- Empty element -->
Please say your name after the tone.  <audio src="beep.wav"/>
<!-- Container element with alternative text -->
<audio src="prompt.au">What city do you want to fly from?</audio>

Usage note 1: The "audio" element is not intended to be a complete mechanism for synchronizing synthetic speech output with other audio output or other output media (video etc.). Instead the "audio" element is intended to support the common case of embedding audio files in voice output.

Usage note 2: The alternative text may contain markup. The alternative text may be used when the audio file is not available, when rendering the document as non-audio output, or when the speech synthesizer does not support inclusion of audio files.

Relevant requirements: 3.10

2.11 "mark" Element

A "mark" element is an empty element that places a marker into the output stream for asynchronous notification. When audio output of the TTS document reaches the mark, the speech synthesizer issues an event that includes the required "name" attribute of the element. The platform defines the destination of the event. The "mark" element does not affect the speech output process.

Go from <mark name="here"/> here, to <mark name="there"/> there!

Usage note 1: When supported by the implementation, requests can be made to pause and resume at document locations specified by the mark values.

Usage note 2: The mark names are not required to be unique within a document.

Relevant requirements: 5.2, 5.3

2.12 Miscellaneous relevant XML features

If a non-validating XML parser is used, an arbitrary XML element can be included in documents to expose platform-specific capabilities. If a validating XML parser is used, then engine-specific elements can be included if they are defined in an extended schema within the document. These extension elements are processed by engines that understand them and ignored by other engines.

Usage note 1: When engines support non-standard elements and attributes it is good practice for the name to identify the feature as non-standard, for example, by using a "x" prefix or a company name prefix.

Relevant requirements: 5.4, 5.5.

3. Future Study

3.1 Interoperability with other Markup Languages

Interoperability of the Speech Synthesis Markup Language with other W3C Markup Languages has been considered as part of the process of developing the initial specification. There are, however, a number of areas that require further study before the specification can be finalized.

Interoperability with the Dialog Markup Language planned for development by the Voice Browser Working Group is a high priority. Since the specification development process for the Dialog Markup Language has not yet commenced we are deferring detailed consideration of the issue. Two possible paths for integration have been identified:
- Copy or mimic the Speech Synthesis Markup Language DTD within the Dialog language.
- Allow Dialog documents to contain speech synthesis documents but in a separate XML namespace.
Interoperability with SMIL -- Synchronized Multimedia Integration Language needs to be explored in more detail. The objective should be to permit synthesized speech output to be coordinated with other forms of output. One challenge is that the timing of synthesized speech is not predictable or controllable in the same way as most media forms.
Interoperability with ACSS - Aural Cascading Style Sheets is an objective. The Speech Synthesis Markup Language defines capabilities that are a super-set of ACSS with the following exceptions:
- The "speak-punctuation" style of ACSS is not currently supported. It's function is to request that content text be read normally except that punctuation symbols are read specially. It is not clear yet what the best way is to represent this within the specification.
- ACSS specifies spatial audio capabilities. The Speech Synthesis Markup Language Requirements determined that spatial audio is a "Nice to Have" feature and no decision has been made yet about incorporating it into the specification.

3.2 Other Phoeneme Alpahbets

The Voice Browser Working Group is considering the additional support of two alphabets that define a mapping from IPA to the ASCII character set: Worldbet(Postscript) and X-SAMPA.

Known IPA limitations include:

IPA is difficult to understand even with ASCII equivalents.
There are no conventions for the use of IPA for specific languages and dialects. WorldBet does indicate some common usage.
IPA editors and fonts containing IPA characters are not widely available.

3.3 Audio Element

A future incarnation of the "audio"element could include a "mode"attribute. If equal to "insertion"(the default), the speech output is temporarily paused, the audio is played then speech is resumed. If equal to "background", the audio is played along with speech output. Currently unresolved are the mechanics of how to specify audio playback behaviors like playback termination, etc.

3.4 Mark Element

There has been discussion that the "mark"element should be an XML identifier ("id" attribute) with values being unique within the scope of the document. In addition, future study needs to ensure that events generated by a mark element are consistent with existing event models in other specifications (e.g. DOM, SMIL and the dialog markup language).

3.5 Unspecified Requirements

The following are "Nice to Have" features specified in the Speech Synthesis Markup Language Requirements which are not currently supported in this specification.

Req 4.5: Synchronized facial animation (nice to have)

3.6 Compliance

The Voice Browser Working Group is currently considering Compliance issues. No decision has been made as to whether the specification should address specific implementation requirements or what form they might take.

3.7 "lowlevel" Elements: Fine-Grained Acoustic-Prosodic Control

The "lowlevel" element is a container for a sequence of phoneme and pitch controls: "ph" and "f0" elements respectively. The attributes of the "lowlevel" container element are:

Optional "alt" attribute that provides a human-readable string that is equivalent to the contained phonemic sequence.
Optional "pitch" attribute with values of "absolute", "relative" and "percentage" that indicate how to interpret the values on the contained pitch elements.
Tentative: Optional "alphabet" attribute with a default value of "ipa" and alternative values of "sampa" and "worldbet". This indicates which phonetic alphabet is used for the phonetic string values.

The "ph" and "f0" elements may be interleaved or placed in separate sequences (as in the example below).

"ph" Element: Phoneme with Duration

A "lowlevel" element may contain a sequence of zero or more "ph" elements. The "ph" element is empty. The "p" attribute is required and has a value that is a phoneme symbol from the IPA alphabet. The optional "d" attribute is the duration in seconds or milliseconds (seconds as default) for the phoneme. If the "d" attribute is omitted a platform-specific default is used.

<lowlevel alt="hello">
  <ph p="pau" d=".21"/><ph p="h" d=".0949"/><ph p="&" d=".0581"/>
  <ph p="l" d=".0693"/><ph p="ou" d=".2181"/>
</lowlevel>
<!-- This example uses WorldBet phonemes -->

"f0" Element: Timed Pitch Targets

A "lowlevel" element may contain a sequence of zero or more "f0" elements. The "f0" element is empty. The "v" (value) attribute is required and should be in the form of an integer or simple floating point number (no exponentials). The value attribute is interpreted according to the value of the "pitch" attribute of the enclosing "lowlevel" element. The optional "t" attribute indicates the time offset from the preceding "f0" element and has a value of seconds or milliseconds (seconds as default). If the "t" attribute is omitted on the first "f0" element in a "lowlevel" container, the specified "f0" target value is aligned with the start of the first non-silent phoneme.

<lowlevel alt="hello" pitch="absolute">
  <ph p="pau" d=".21"/><ph p="h" d=".0949"/><ph p="&" d=".0581"/>
  <ph p="l" d=".0693"/><ph p="ou" d=".2181"/>
  <!-- This example uses WorldBet phonemes -->

  <f0 v="103.5"/> <f0 v="112.5" t=".075"/>
  <f0 v="113.2" t=".175"/> <f0="128.1" t=".28"/>
</lowlevel>

Usage note 1: It is anticipated that low-level markup will be generated by automated tools, so compactness is given priority over readability.

Relevant requirements: 3.7

Issues:

There is an unresolved request to require that the "fo" and "ph" elements be interleaved within the "lowlevel" element so that they are in exact temporal order. This change is simple to require but requires us to ensure that the duration attributes be interpreted consistently. It has been proposed that for the "ph" element the "d" attribute be an offset from the prior "ph" element but that for the "f0" element it should be an offset from the previous "ph" or "f0" element. A diagram would help here.
The attribute names for this element set need to be similar, identical, or somehow consistent with those of the "prosody" element.
Would "pi" or "fr" be preferrable to "f0": i.e. pitch or frequency vs. the technical abbreviation for fundamental frequency.
The "phoneme" element and "lowlevel" are inconsistent in that the phone string is an attribute in "phoneme" and part of the content for "lowlevel". Also, the alternative text is the contents of the "phoneme" element but an attribute of "lowlevel". Perhaps these inconsistencies are unavoidable?
This element should track changes in the "phoneme" element. e.g. if "phoneme" adds an "alphabet" attribute that allows the specification of IPA, WorldBet or possibly other phonemic alphabets, then a similar attribute should be added to the "lowlevel" element.

3.8 Intonational Controls

The existing specification supports many ways by which a document author can affect the intonational rendering of speech output. In part, this reflects the broad communicate role of intonation in spoken language: it reflects document structure (see the paragraph and sentence elements), prominence (see the emphasis element), and prosodic boundaries (see the break element). Intonation also reflects emotion and many less definable characteristics that are not planned for inclusion in this specication.

The specification could be enhanced to provide specific intonational controls at boundaries and at points of emphasis. In both cases there are existing elements to which intonational attributes could be added. The issues that need to be addressed are:

Determining the form that the attributes should take,
Ensuring that the attributes are applicable to a wide set of languages,
Ensuring that use of the attributes does not require specialized knowledge of intonation theory.

Intonational boundaries: The existing specification allows a document to mark major boundaries and structures using the paragraph and sentence elements and the break element. The break element explicitly marks a boundary whereas boundaries implicitly occur at both the start and end of paragraphs and sentences. For each of these boundary locations we could specify intonational patterns such as a rise, fall, flat, low-rising, high-falling and some more complex patterns. Proposals received to date include use of labelling systems from intonational theory or use of punctuation symbols such as '?', '!' and '.'.

Emphasis tones: The emphasis elementcan be used to explicitly mark any word or word sequence as emphasized. Each spoken language has patterns by which emphasis is marked intonationally. For example, for English, the more common emphasis tones are high, low, low-rising, and high-downstep. Our challenge is to determine a set of tones that has sufficient coverage of the tones of many spoken languages to be useful, but which does not require extensive theoretical knowledge.

3.9 "value" Element

A "value" element has been proposed that permits substitution of a variable into the text stream. The variable's value must be defined separately, either by a "set" element (not yet defined) earlier in the document or in the host environment (e.g. in a voice browser). The value is a plain text string (markup may be ignored).

name: the name of the variable to be inserted in the text stream.
type: same format as the "type" attribute of the "sayas" element allowing the text to be marked as a phone number, date, time etc.

The time is <value name="currentTime"/>.

Relevant requirements: 3.11

Issues:

The "value" element is equivalent to the "value" element of the VoiceXML specification. Unlike the Voice Browser which interprets VoiceXML, a speech synthesizer does not typically have persistent variables and would not normally have access to the internal variable of a Voice Browser. One proposal is for the Dialog ML to define a "value" element in its namespace and to convert that element to normal speech synthesis markup before passing the document to a speech synthesis. This is consistent with the spirit of the speech synthesis markup language as a "final form" representation. A downside of this approach is that the DTD for speech synthesis in the Dialog ML would be inconsistent with this specification.
VoiceXML permits values to be stored audio files.

4. Examples

The following is an example of reading headers of email messages. The paragraph and sentence elementsare used to mark the text structure. The sayas element is used to indicate text constructs such as the time and proper name. The break element is placed before time and has the effect of marking the time as important information for the listener to pay attention to. The prosody element is used to slow the speaking rate of the email subject so that the user has extra time to listen and write down the details.

<?xml version="1.0"?>
<speak>
<paragraph>

<sentence>You have 4 new messages.</sentence>

<sentence>The first is from <sayas 
type="name">Stephanie Williams</sayas>
and arrived at <break/>
<sayas type="time">3:45pm</sayas>.</sentence>

<sentence>The subject is <prosody
rate="-20%">ski trip</prosody></sentence>

</paragraph>
</speak>

The following example combines audio files and different spoken voices to provide information on a collection of music.

<?xml version="1.0"?>
<speak>

<paragraph><voice gender="male">

<sentence>Today we preview the latest romantic music
 from the W3C.</sentence>

<sentence>Hear what the Software Reviews said about
 Tim Lee's newest hit.</sentence>

</voice></paragraph>

<paragraph><voice gender="female">
He sings about issues that touch us all.
</voice></paragraph>

<paragraph><voice gender="male">
Here's a sample.  <audio src="http://www.w3c.org/music.wav">
Would you like to buy it?</voice></paragraph>

</speak>

5. DTD for the Speech Synthesis Markup Language

<?xml version="1.0" encoding="ISO-8859-1"?>

<!-- Speech Synthesis Markup Language  v0.5 20000504 -->

<!ENTITY % allowed-within-sentence " #PCDATA | sayas | phoneme |
     voice | emphasis | break | prosody | audio | value | mark " >

<!ENTITY % structure "paragraph | p | sentence | s">

<!ENTITY % duration "CDATA">

<!ENTITY % integer "CDATA" >

<!ENTITY % uri "CDATA" >

<!ENTITY % phoneme-string "CDATA" >
<!ENTITY % phoneme-alphabet "CDATA" >

<!-- Definitions of the structural elements. -->
<!-- Currently, these elements support only the xml:lang attribute -->

<!ELEMENT speak (%allowed-within-sentence; | %structure;)*>

<!ELEMENT paragraph (%allowed-within-sentence; | sentence | s)*>

<!ELEMENT sentence (%allowed-within-sentence;)*>

<!-- The flexible container elements can occur within paragraph -->
<!-- and sentence but may also contain these structural elements. -->

<!ENTITY % voice-name "CDATA">

<!ELEMENT voice (%allowed-within-sentence; | %structure;)*>
<!ATTLIST voice
     gender   (male|female|neutral)                  #IMPLIED
     age      (%integer;|child|teenager|adult|elder) #IMPLIED
     variant  (%integer;|different)                  #IMPLIED
     name     (%voice-name;|default)                 #IMPLIED >

<!ELEMENT prosody (%allowed-within-sentence; | %structure;)*>
<!ATTLIST prosody
     pitch      CDATA  #IMPLIED
     contour    CDATA  #IMPLIED
     range      CDATA  #IMPLIED
     rate       CDATA  #IMPLIED
     duration   CDATA  #IMPLIED
     volume     CDATA  #IMPLIED >

<!ELEMENT audio (%allowed-within-sentence; | %structure;)*>
<!ATTLIST audio
     src        %uri;                  #IMPLIED >

<!-- These basic container elements can contain any of the -->
<!-- within-sentence elements, but neither sentence or paragraph. -->

<!ELEMENT emphasis (%allowed-within-sentence;)*>
<!ATTLIST emphasis
     level      (strong|moderate|none|reduced)  'moderate' >

<!-- These basic container elements can contain only data -->

<!ENTITY % sayas-types
    "(acronym|number|ordinal|digits|telephone|date|time|
      duration|currency|measure|name|net|address)">

<!ELEMENT sayas (#PCDATA)>
<!ATTLIST sayas
     type   %sayas-types;   #REQUIRED >

<!ELEMENT phoneme (#PCDATA)>
<!ATTLIST phoneme
     ph        %phoneme-string;   #REQUIRED
     alphabet  %phoneme-alphabet; #IMPLIED >

<!-- Definitions of the basic empty elements -->

<!ELEMENT break EMPTY>
<!ATTLIST break
     size      (large|medium|small|none)  'medium'
     time      %duration;                 #IMPLIED >

<!ELEMENT mark EMPTY>
<!ATTLIST mark
     name      CDATA   #REQUIRED >

The following is a fragments of DTD that represent the elements described for Future Study

<!-- Value element -->

<!ELEMENT value EMPTY>
<!ATTLIST value
     name      CDATA   #REQUIRED >

<!-- Low-level elements -->

<!ENTITY % lowlevel-content " ph | f0 " >

<!ENTITY % pitch-types " (absolute|relative|percent) 'absolute' ">

<!ELEMENT lowlevel ( %lowlevel-content; )*>
<!ATTLIST lowlevel
     alt      CDATA              #IMPLIED
     pitch    %pitch-types;      #IMPLIED
     alphabet %phoneme-alphabet; #IMPLIED >

<!ELEMENT ph EMPTY>
<!ATTLIST ph
     p        %phoneme-alphabet;  #REQUIRED
     d        CDATA               #IMPLIED >

<!ELEMENT f0 EMPTY>
<!ATTLIST f0
     v        CDATA               #REQUIRED 
     t        CDATA               #IMPLIED >

6. Further Reading Material

The following resources are related to the Speech Synthesis Markup Language requirements and specification.

Java Speech API Markup Language

(http://java.sun.com/products/java-media/speech/forDevelopers/JSML/index.html)
JSML is an XML specification for controlling text-to-speech engines. Implementations are available from IBM, Lernout & Hauspie and in the Festival speech synthesis platform and in other implementations of the Java Speech API.

SABLE

(http://www.research.att.com/~rws/Sable.v1_0.htm)
SABLE is a markup language for controlling text to speech engines. It has evolved out of work on combining three existing text to speech languages: SSML, STML and JSML. Implementations are available for the Bell Labs synthesizer and in the Festvial speech synthesizer. The following are two of the papers written about SABLE and its applications:

SABLE: A Standard for TTS Markup, Sproat et. al. (http://www.research.att.com/~rws/SABPAP/sabpap.htm)
SABLE: an XML-based Aural Display List For The WWW, Sproat and Raman. (http://www.bell-labs.com/project/tts/csssable.html)

Spoken Text Markup Language

(http://www.cstr.ed.ac.uk/publications/1997/Sproat_1997_a.ps)
STML is an SGML language for controlling text to speech engines developed jointly by Bell Laboratories and by the Centre for Speech Technology Research, Edinburgh University.

Microsoft Speech API Control Codes

(http://www.microsoft.com/iit/)
SAPI defines a set of inline control codes for manipulating speech output by SAPI speech synthesizers.

VoiceXML Prompts

(http://www.voicexml.com/)
The Voice XML specification for dialog systems development includes a set of prompt elements for generating speech synthesis and other audio output that are very similar to elements of JSML and SABLE.

7. Acknowledgements

This document was written with the participation of the members of the W3C Voice Browser Working Group (listed in alphabetical order):

Brian Eberman, SpeechWorks
Jim Larson, Intel
Bruce Lucas, IBM
T.V. Raman, IBM
Dave Raggett, W3C/HP
Richard Sproat, AT&T
Kuansan Wang, Microsoft