Speech Synthesis Markup Language Version 1.0

W3C Working Draft 02 December 2002

This version:: http://www.w3.org/TR/2002/WD-speech-synthesis-20021202/
Latest version:: http://www.w3.org/TR/speech-synthesis/
Previous version:: http://www.w3.org/TR/2002/WD-speech-synthesis-20020405/
Editors:: Daniel C. Burnett, Nuance; Mark R. Walker, Intel; Andrew Hunt, SpeechWorks International

Abstract

The Voice Browser Working Group has sought to develop standards to enable access to the Web using spoken interaction. The Speech Synthesis Markup Language Specification is part of this set of new markup specifications for voice browsers, and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.

Status of this Document

This is a W3C Last Call Working Draft for review by W3C Members and other interested parties. Last Call means that the Working Group believes that this specification is technically sound and therefore wishes this to be the last call for comments. If the feedback is positive, the Working Group plans to submit it for consideration as a W3C Candidate Recommendation. Comments can be sent until 15 January 2003.

Reviewers are encouraged to subscribe to the public discussion list <www-voice@w3.org> and to mail in comments as soon as possible. To subscribe, send an email to <www-voice-request@w3.org> with the word subscribe in the subject line (include the word unsubscribe to unsubscribe). A public archive is available on-line. Following the publication of a previous Last Call Working Draft of this specification, the group received a number of public comments. Those comments have not been addressed in this current document but will be addressed along with any other comments received during the review period for this document. Commenters who have sent their comments to the public mailing list need not resubmit their comments in order for them to be addressed as part of the Last Call review.

This specification describes markup for generating synthetic speech via a speech synthesizer, and forms part of the proposals for the W3C Speech Interface Framework. This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group (W3C Members only). This is a Royalty Free Working Group, as described in W3C's Current Patent Practice Note. Working Group participants are required to provide patent disclosures.

Although an Implementation Report Plan has not yet been developed for this specification, the Working Group currently expects to require at least two independently developed interoperable implementations of each required feature, and at least one implementation of each feature, in order to exit the next phase of this document, the Candidate Recommendation phase. To help the Voice Browser Working Group build such a report, reviewers are encouraged to implement this specification and to indicate to W3C which features have been implemented, and any problems that arose.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite W3C Working Drafts as other than "work in progress". A list of current W3C Recommendations and other technical reports can be found at http://www.w3.org/TR/.

0. Table of Contents

1. Introduction
2. Elements and Attributes
- 2.1 Document Structure, Text Processing and Pronunciation
  - 2.1.1 "speak" Root Element
  - 2.1.2 "xml:lang" Language Attribute
  - 2.1.3 "paragraph" and "sentence"
  - 2.1.4 "say-as" Element
  - 2.1.5 "phoneme" Element
  - 2.1.6 "sub" Element
- 2.2 Prosody and Style
  - 2.2.1 "voice" Element
  - 2.2.2 "emphasis" Element
  - 2.2.3 "break" Element
  - 2.2.4 "prosody" Element
- 2.3 Other Elements
  - 2.3.1 "audio" Element
  - 2.3.2 "mark" Element
  - 2.3.3 "desc" Element
3.SSML Documents
- 3.1 Document Form
- 3.2 Base URI
- 3.3 Pronunciation Lexicon
- 3.4 Meta data
- 3.5 Integration With Other Markup Languages
  - 3.5.1 SMIL
  - 3.5.2 ACSS
  - 3.5.3 VoiceXML
- 3.6 Fetching SSML Documents
4. Conformance
5. References
6. Acknowledgements
Appendix A. Example SSML (Informative)
Appendix B. DTD for the Speech Synthesis Markup Language (Informative)
Appendix C. Schema for the Speech Synthesis Markup Language (Normative)
Appendix D. Audio File Formats (Normative)
Appendix E. MIME Types and File Suffix (Informative)
Appendix F. Features under Consideration for Future Versions (Informative)
Appendix G. Internationalization (Normative)

1. Introduction

The W3C Standard is known as the Speech Synthesis Markup Language specification (SSML) and is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc., California, U.S.A. The JSML specification can be found at [JSML].

SSML is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms. A related initiative to estabilish a standard system for marking up text input is SABLE [SABLE].

1.1 Vocabulary and Design Concepts

There is some variance in the use of technical vocabulary in the speech synthesis community. The following definitions establish a common understanding for this document.

Voice Browser	A device which interprets a (voice) markup language and is capable of generating voice output and/or interpreting voice input, and possibly other input/output modalities.
Speech Synthesis	The process of automatic generation of speech output from data input which may include plain text, formatted text or binary objects.
Text-To-Speech	The process of automatic generation of speech output from text or annotated text input.

The design and standardization process has followed from the Speech Synthesis Markup Requirements for Voice Markup Languages [REQS].

The following items were the key design criteria.

Consistency: provide predictable control of voice output across platforms and across speech synthesis implementations.
Interoperability: support use along with other W3C specifications including (but not limited to) VoiceXML, Aural Cascading Style Sheets and SMIL.
Generality: support speech output for a wide range of applications with varied speech content.
Internationalization: Enable speech output in a large number of languages within or across documents.
Generation and Readability: Support automatic generation and hand authoring of documents. The documents should be human-readable.
Implementable: The specification should be implementable with existing, generally available technology and the number of optional features should be minimal.

1.2 Speech Synthesis Processes

A Text-To-Speech system (a synthesis processor) that supports SSML will be responsible for rendering a document as spoken output and for using the information contained in the markup to render the document as intended by the author.

Document creation: A text document provided as input to the synthesis processor may be produced automatically, by human authoring, or through a combination of these forms. SSML defines the form of the document.

Document processing: The following are the six major processing steps undertaken by a synthesis processor to convert marked-up text input into automatically generated voice output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control the final voice output.

XML Parse: An XML parser is used to extract the document tree and content from the incoming text document. The structure, tags and attributes obtained in this step influence each of the following steps.
Structure analysis: The structure of a document influences the way in which a document should be read. For example, there are common speaking patterns associated with paragraphs and sentences.

- Markup support: The paragraph and sentence elements defined in SSML explicitly indicate document structures that affect the speech output.

- Non-markup behavior: In documents and parts of documents where these elements are not used, the synthesis processor is responsible for inferring the structure by automated analysis of the text, often using punctuation and other language-specific data.
Text normalization: All written languages have special constructs that require a conversion of the written form (orthographic form) into the spoken form. Text normalization is an automated process of the synthesis processor that performs this conversion. For example, for English, when "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two" and so on.

- Markup support: The say-as element can be used in the input document to explicitly indicate the presence and type of these constructs and to resolve ambiguities. The set of constructs that can be marked has not yet been defined but might include dates, times, numbers, acronyms, currency amounts and more.

- Non-markup behavior: For text content that is not marked with the say-as element the synthesis processor is expected to make a reasonable effort to automatically locate and convert these constructs to a speakable form. Because of inherent ambiguities (such as the "1/2" example above) and because of the wide range of possible constructs in any language, this process may introduce errors in the speech output and may cause different processors to render the same document differently.
Text-to-phoneme conversion: Once the processor has determined the set of words to be spoken it must convert those words to a string of phonemes. A phoneme is the basic unit of sound in a language. Each language (and sometimes each national or dialect variant of a language) has a specific phoneme set: e.g., most US English dialects have around 45 phonemes. In many languages this conversion is ambiguous since the same written word may have many spoken forms. For example, in English, "read" may be spoken as "reed" (I will read the book) or "red" (I have read the book). Another issue is the handling of words with non-standard spellings or pronunciations. For example, an English synthesis processor will often have trouble determining how to speak some non-English-origin names; e.g. "Tlalpachicatl" which has a Mexican/Aztec origin.

- Markup support: The phoneme element allows a phonemic sequence to be provided for any word or word sequence. This provides the content creator with explicit control over pronunciations. The say-as element might also be used to indicate that text is a proper name that may allow a synthesis processor to apply special rules to determine a pronunciation. The lexicon element can be used to reference external definitions of pronunciations.

- Non-markup behavior: In the absence of a phoneme element the synthesis processor must apply automated capabilities to determine pronunciations. This is typically achieved by looking up words in a pronunciation dictionary and applying rules to determine other pronunciations. Most processors are expert at performing text-to-phoneme conversions so most words of most documents can be handled automatically.
Prosody analysis: Prosody is the set of features of speech output that includes the pitch (also called intonation or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words and many other features. Producing human-like prosody is important for making speech sound natural and for correctly conveying the meaning of spoken language.

- Markup support: The emphasis element, break element and prosody element may all be used by document creators to guide the synthesis processor in generating appropriate prosodic features in the speech output.

- Non-markup behavior: In the absence of these elements, synthesis processors are expert (but not perfect) in automatically generating suitable prosody. This is achieved through analysis of the document structure, sentence syntax, and other information that can be inferred from the text input.
Waveform production: The phonemes and prosodic information are used by the synthesis processor in the production of the audio waveform. There are many approaches to this processing step so there may be considerable platform-specific variation.

- Markup support: SSML does not provide explicit controls over the generation of waveforms. The voice element allows the document creator to request a particular voice or specific voice qualities (e.g. a young male voice). The audio element allows for insertion of recorded audio data into the output stream.

1.3 Document Generation, Applications and Contexts

There are many classes of document creator that will produce marked-up documents to be spoken by a synthesis processor. Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in the previous section. The following are some of the common cases.

The document creator has no access to information to mark up the text. All processing steps in the synthesis processor must be performed fully automatically on raw text. The document requires only the containing speak element to indicate the content is to be spoken.
When marked text is generated programmatically the creator may have specific knowledge of the structure and/or special text constructs in some or all of the document. For example, an email reader can mark the location of the time and date of receipt of email. Such applications may use elements that affect structure, text normalization, prosody and possibly text-to-phoneme conversion.
Some document creators make considerable effort to mark as many details of the document to ensure consistent speech quality across platforms and to more precisely specify output qualities. In these cases, the markup may use any or all of the available elements to tightly control the speech output. For example, prompts generated in telephony and voice browser applications may be fine-tuned to maximize the effectiveness of the overall system.
The most advanced document creators may skip the higher-level markup (structure, text normalization, text-to-phoneme conversion, and prosody analysis) and produce low-level TTS markup for segments of documents or for entire documents. This typically requires tools to generate sequences of phonemes, plus pitch and timing information. For instance, tools that do "copy synthesis" or "prosody transplant" try to emulate human speech by copying properties from recordings.

The following are important instances of architectures or designs from which marked-up synthesis documents will be generated. The language design is intended to facilitate each of these approaches.

Dialog language: It is a requirement that it should be possible to include documents marked with SSML into the dialog description document to be produced by the Voice Browser Working Group.
Interoperability with Aural CSS : Any HTML processor that is Aural CSS-enabled can produce SSML. ACSS is covered in Section 19 of the Cascading Style Sheets, level 2 (CSS2) Specification [CSS2]. This usage of speech synthesis facilitates improved accessibility to existing HTML and XHTML content.
Application-specific style sheet processing: As mentioned above, there are classes of application that have knowledge of text content to be spoken, and this can be incorporated into the speech synthesis markup to enhance rendering of the document. In many cases, it is expected that the application will use style sheets to perform transformations of existing XML documents to speech synthesis markup. This is equivalent to the use of ACSS with HTML and once again SSML is the "final form" representation to be passed to the synthesis processor. In this context, SSML may be viewed as a superset of ACSS [CSS2, Section 19] capabilities, excepting spatial audio.

1.4 Platform-Dependent Output Behavior of Speech Synthesis Content

SSML provides a standard way to specify gross properties of synthetic speech production such as pronunciation, volume, pitch, rate and etc. Exact specification of synthetic speech output behavior across disparate processors, however, is beyond the scope of this document.

1.5 Terminology

Requirements terms: The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. However, for readability, these words do not appear in all uppercase letters in this specification.
URI: Uniform Resource Identifier: A URI is a unifying syntax for the expression of names and addresses of objects on the network as used in the World Wide Web. A URI is defined as any legal 'anyURI ' primitive as defined in XML Schema Part 2: Datatypes [SCHEMA2 §3.2.17]. The Schema definition follows [RFC2396] and [RFC2732]. Any relative URI reference must be resolved according to the rules given in Section 3.2.1. In this specification URIs are provided as attributes to elements, for example in the audio and lexicon elements.
Media Type: A media type (defined in [RFC2045] and [RFC2046]) specifies the nature of a linked resource. Media types are case insensitive. A list of registered media types is available for download [TYPES].
[See Appendix E for information on media types for SSML.]
Language identifier: A language identifier labels information content as being of a particular human language variant. Following the XML specification for language identification [XML §2.12] a legal language identifier in SSML is identified by an RFC 3066 [RFC3066] code. A language code is required by RFC 3066. A country code or other subtag identifier is optional by RFC 3066. Section 2.1.2 describes how and where the xml:lang attribute can be used to specify a language.
Error: A violation of the rules of this specification; results are undefined. A conforming processor may detect and report an error and may recover from it.
Fatal error: An error which a conforming SSML processor must detect and report to the host environment. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to create audio or other output).
At user option: A conforming processor may or must (depending on the modal verb in the sentence) behave as described; if it does, it must provide users a means to enable or disable the behavior described.

2. Elements and Attributes

The following elements are defined in this specification.

2.1 Document Structure, Text Processing and Pronunciation
- 2.1.1 "speak" Root Element
- 2.1.2 "xml:lang" Language Attribute
- 2.1.3 "paragraph" and "sentence"
- 2.1.4 "say-as" Element
- 2.1.5 "phoneme" Element
- 2.1.6 "sub" Element
2.2 Prosody and Style
- 2.2.1 "voice" Element
- 2.2.2 "emphasis" Element
- 2.2.3 "break" Element
- 2.2.4 "prosody" Element
2.3 Other Elements
- 2.3.1 "audio" Element
- 2.3.2 "mark" Element
- 2.3.3 "desc" Element

2.1 Document Structure, Text Processing and Pronunciation

2.1.1 speak Root Element

The Speech Synthesis Markup Language is an XML application. The root element is speak. xml:lang is a defined attribute specifying the language of the root document. xml:base is a defined attribute specifying the Base URI of the root document. The version attribute is a required attribute that indicates the version of the specification to be used for the document. The version number for this specification is 1.0.

<?xml version="1.0" encoding="ISO-8859-1?>
<speak version="1.0"
         xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  ... the body ...
</speak>

The following elements can occur within the content of the speak element: audio, break, emphasis, lexicon, mark, metadata, p, paragraph, phoneme, prosody, say-as, sub, s, sentence, voice.

2.1.2 `xml:lang` Attribute: Language

Following the XML 1.0 [XML §2.12], languages are indicated by an xml:lang attribute on the enclosing element with the value being a language identifier.

Language information is inherited down the document hierarchy, i.e. it has to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.

xml:lang is a defined attribute for the voice, speak, paragraph, sentence, p, and s elements.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <paragraph>I don't speak Japanese.</paragraph>
  <paragraph xml:lang="ja">Nihongo-ga wakarimasen.</paragraph>
</speak>

The speech synthesis processor largely determines behavior in the case that a document requires speech output in a language not supported by the processor. Specifying xml:lang does not imply a change in voice, though this may indeed occur. When a given voice is unable to speak content in the indicated language, a new voice may be selected by the processor. No change in the voice or prosody should occur if the xml:lang value is the same as the inherited value. Further information about voice selection appears in Section 2.2.1.

There may be variation across conformant processors in the implementation of xml:lang for different markup elements (e.g. paragraph and sentence elements).

All elements should process their contents specific to the enclosing language. For instance, the phoneme, emphasis, break, paragraph and sentence elements should each be rendered in a manner that is appropriate to the current language.

The text normalization processing step may be affected by the enclosing language. This is true for both markup support by the say-as element and non-markup behaviour. In the following example the same text "2/1/2000" may be read as "February first two thousand" in the first sentence, following American English pronunciation rules, but as "the second of January two thousand" in the second one, which follows Italian preprocessing rules.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <sentence>Today, 2/1/2000.</sentence>
  <!-- Today, February first two thousand -->
  <sentence xml:lang="it">Un mese fà, 2/1/2000.</sentence>
  <!-- Un mese fà, il due gennaio duemila -->
  <!-- One month ago, the second of January two thousand -->
</speak>

2.1.3 paragraph and sentence: Text Structure Elements

A paragraph element represents the paragraph structure in text. A sentence element represents the sentence structure in text. For brevity, the markup also supports p and s as exact equivalents of paragraph and sentence.

xml:lang is a defined attribute on the paragraph, sentence, p and s elements.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <paragraph>
    <sentence>This is the first sentence of the paragraph.</sentence>
    <sentence>Here's another sentence.</sentence>
  </paragraph>
</speak>

The use of paragraph and sentence elements is optional. Where text occurs without an enclosing paragraph or sentence elements the speech synthesis processor should attempt to determine the structure using language-specific knowledge of the format of plain text.

The following elements can occur within the content of the paragraph or p elements: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, s, sentence, voice.

The following elements can occur within the content of the sentence or s elements: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, voice.

2.1.4 say-as Element

The say-as element allows the author to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.

Defining a comprehensive set of text format types is difficult because of the variety of languages that must be considered and because of the innate flexibility of written languages. SSML only specifies the say-as element, its attributes, and their purpose. It does not enumerate the possible values for the attributes. The Working Group expects to produce a separate document that will define standard values and associated normative behavior for these values. Examples given here are only for illustrating the purpose of the element and the attributes.

The say-as element has three attributes: interpret-as, format, and detail. The interpret-as attribute is always required; the other two attributes are optional. The legal values for the format attribute depend on the value of the interpret-as attribute.

The following elements can occur within the content of the say-as element: none.

The `interpret-as` and `format` attributes

The interpret-as attribute indicates the content type of the contained text construct. Specifying the content type helps the SSML processor to distinguish and interpret text constructs that may be rendered in different ways depending on what type of information is intended. In addition, the optional format attribute can give further hints on the precise formatting of the contained text for content types that may have ambiguous formats.

When specified, the interpret-as and format values are to be interpreted by the SSML processor as hints provided by the markup document author to aid text normalization and pronunciation.

In all cases, the text enclosed by any say-as element is intended to be a standard, orthographic form of the language currently in context. An SSML processor should be able to support the common, orthographic forms of the specified language for every content type that it supports.

When the value for the interpret-as attribute is unknown or unsupported by a processor, it must render the contained text as if no interpret-as value were specified.

When the value for the format attribute is unknown or unsupported by a processor, it must render the contained text as if no format value were specified, and should render it using the interpret-as value that is specified.

When the content of the element does not match the content type and/or format specified, an SSML processor should proceed and attempt to render the information. When the content of the element contains other text in addition to the indicated content type, the SSML processor must attempt to render such text.

Indicating the content type or format does not necessarily affect the way the information is pronounced. An SSML processor should pronounce the contained text in a manner in which such content is normally produced for the locale.

Example values for the interpret-as and format attributes: (please note that these values are just for illustration; they are not suggested or endorsed values)

interpret-as	format	interpretation	Examples
`number`	`ordinal`	interpret the content as an ordinal number	<say-as interpret-as="number" format="ordinal">5</say-as> : fifth
`number`	`cardinal`	interpret the content as a cardinal number	<say-as interpret-as="number" format="cardinal">VII</say-as> : seven
`number`	`telephone`	interpret the content as a telephone number	<say-as interpret-as="number" format="telephone">123-456-7890</say-as>
`date`	`mdy`	interpret the content as a date in month-day-year format	<say-as interpret-as="date" format="mdy">5/12/2003</say-as> : May twelfth, two thousand three
`digits`		interpret the content as digits	<say-as interpret-as="digits"> 123 < /say-as> : one two three
`ordinal`		interpret the content as an ordinal number	<say-as interpret-as="ordinal"> 123 < /say-as> : one hundred and twenty third
`cardinal`		interpret the content as an cardinal number	<say-as interpret-as="cardinal"> 123 < /say-as> : one hundred and twenty three
`letters`		interpret the content as letters	<say-as interpret-as="letters"> W3C < /say-as> : double-u three cee
`words`		interpret the content as words	<say-as interpret-as="words"> ASCII </say-as> : askie

The `detail` attribute

The detail attribute is an optional attribute that indicates the level of detail to be read aloud or rendered. Every value of the detail attribute must render all of the informational content in the contained text; however, specific values for the detail attribute can be used to render content that is not usually informational in running text but may be important to render for specific purposes. For example, an SSML processor will usually render punctuations through appropriate changes in prosody. Setting a higher level of detail may be used to speak punctuations explicitly, e.g. for reading out coded part numbers or pieces of software code.

The detail attribute can be used for all say-as content types.

If the detail attribute is not specified, the level of detail that is produced by the SSML processor depends on the text content and the locale.

When the value for the detail attribute is unknown or unsupported by a processor, it must render the contained text as if no value were specified for the detail attribute.

Example values for the detail attribute: (please note that these values are just for illustration; they are not suggested or endorsed values)

interpret-as	format	detail	interpretation	Examples
		`dictate`	dictate the text	<say-as interpret-as="" detail="dictate">It's simple, isn't it?</say-as> : It's simple comma isn't it question mark
`letters`		`strict`	speak letters with all detail	<say-as interpret-as="letters" detail="strict">X4:5à-bB2</say-as> : capital X four colon five A with grave accent dash B capital B two
`number`	`telephone`	`punctuation`	speak the punctuation marks given in the telephone number	<say-as interpret-as="number" format="telephone" detail="punctuation">09/123.45.67</say-as> : zero nine slash one hundred twenty-three dot forty-five dot sixty-seven

2.1.5 phoneme Element

The phoneme element provides a phonetic pronunciation for the contained text. The phoneme element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.

The ph attribute is a required attribute that specifies the phoneme string.

The alphabet attribute is an optional attribute that specifies the phonetic alphabet. SSML processors should support a value for alphabet of "ipa", corresponding to characters composing the International Phonetic Alphabet [IPA]. In addition to an exhaustive set of vowel and consonant symbols, IPA supports a syllable delimiter, numerous diacritics, stress symbols, lexical tone symbols, intonational markers and more.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <phoneme alphabet="ipa" ph="t&#x252;m&#x251;to&#x28A;"> tomato </phoneme>
  <!-- This is an example of IPA using character entities -->
</speak>

It is an error if a value for alphabet is specified that is not known or cannot be applied by an SSML processor.

The following elements can occur within the content of the phoneme element: none.

2.1.6 sub Element

The sub element is employed to indicate that the specified text replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. The required alias attribute specifies the string to be substituted for the enclosed string. The processor should apply text normalization to the alias value.

The following elements can occur within the content of the sub element: none.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <sub alias="World Wide Web Consortium"> W3C </sub>
  <!-- World Wide Web Consortium -->
</speak>

2.2 Prosody and Style

2.2.1 voice Element

The voice element is a production element that requests a change in speaking voice. Attributes are:

xml:lang : optional language specification attribute.
gender : optional attribute indicating the preferred gender of the voice to speak the contained text. Enumerated values are: "male", "female", "neutral".
age : optional attribute indicating the preferred age of the voice to speak the contained text. Acceptable values are of type (integer)
variant : optional attribute indicating a preferred variant of the other voice characteristics to speak the contained text. (e.g. the second or next male child voice). Valid values of variant are integers.
name : optional attribute indicating a platform-specific voice name to speak the contained text. The value may be a space-separated list of names ordered from top preference down. As a result a name must not contain any whitespace.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">   
  <voice gender="female">Mary had a little lamb,</voice>
  <!-- now request a different female child's voice -->
  <voice gender="female" variant="2">
  It's fleece was white as snow.
  </voice>
  <!-- platform-specific voice selection -->
  <voice name="Mike">I want to be like Mike.</voice>
</speak>

When there is not a voice available that exactly matches the attributes specified in the document, or multiple voices that match the criteria, the voice selection algorithm may be processor-specific.
If a voice is available for a requested xml:lang, an SSML processor must use it. If there are multiple such voices available, the processor should use the voice that best matches the specified values for name, variant, gender and age.
If there is no voice available for the requested language, the processor should select a voice that is closest to the requested language (e.g. same language but different region). If there are multiple such voices available, the processor should use a voice that best matches the specified values for name, variant, gender and age.
It is an error if the processor decides it does not have a voice that sufficiently matches the above criteria.

Note: The group is considering adding more explicit control over voice selection in a future version of the SSML Specification.

voice attributes are inherited down the tree including to within elements that change the language.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <voice gender="female"> 
    Any female voice here.
    <voice age="6"> 
      A female child voice here.
      <paragraph xml:lang="ja"> 
        <!-- A female child voice in Japanese. -->
      </paragraph>
    </voice>
  </voice>
</speak>

A change in voice resets the prosodic parameters since different voices have different natural pitch and speaking rates. Volume is the only exception.

The quality of the output audio or voice may suffer if a change in voice is requested within a sentence.

The following elements can occur within the content of the voice element: audio, break, emphasis, mark, p, paragraph, phoneme, prosody, say-as, sub, s, sentence, voice.

2.2.2 emphasis Element

The emphasis element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The synthesis processor determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:

level : the optional level attribute indicates the strength of emphasis to be applied. Defined values are "strong", "moderate", "none" and "reduced". The default level is "moderate". The meaning of "strong" and "moderate" emphasis is interpreted according to the language being spoken (languages indicate emphasis using a possible combination of pitch change, timing changes, loudness and other acoustic differences). The "reduced" level is effectively the opposite of emphasizing a word. For example, when the phrase "going to" is reduced it may be spoken as "gonna". The "none" level is used to prevent the speech synthesis processor from emphasizing words that it might typically emphasize.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  That is a <emphasis> big </emphasis> car!
  That is a <emphasis level="strong"> huge </emphasis>
  bank account!
</speak>

The following elements can occur within the content of the emphasis element: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, voice.

2.2.3 break Element

The break element is an empty element that controls the pausing or other prosodic boundaries between words. The use of the break element between any pair of words is optional. If the element is not present between words, the speech synthesis processor is expected to automatically determine a break based on the linguistic context. In practice, the break element is most often used to override the typical automatic behavior of a synthesis processor. The attribute is:

time : the time attribute is an optional attribute indicating the duration of a pause. Legal values are: durations in seconds or milliseconds, "none", "x-small", "small", "medium" (default value), "large", or "x-large". Durations follow the "Times" attribute format from the [CSS2] specification. e.g. "250ms", "3s". The value "none" indicates that a normal break boundary should be used. The other five values indicate increasingly larger break boundaries between words. The larger boundaries are typically accompanied by pauses.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  Take a deep breath <break/>
  then continue. 
  Press 1 or wait for the tone. <break time="3s"/>
  I didn't hear you!
</speak>

The following elements can occur within the content of the break element: none.

2.2.4 prosody Element

The prosody element permits control of the pitch, speaking rate and volume of the speech output. The attributes, all optional, are:

pitch : the baseline pitch for the contained text. Legal values are: a number followed by "Hz", a relative change or "x-high", "high", "medium", "low", "x-low", or "default".
contour : sets the actual pitch contour for the contained text. The format is specified in Pitch contour below.
range : the pitch range (variability) for the contained text. Legal values are: a number followed by "Hz", a relative change or "x-high", "high", "medium", "low", "x-low", or "default".
rate : the speaking rate in words-per-minute for the contained text. Legal values are: a number, a relative change or "x-fast", "fast", "medium", "slow", "x-slow", or "default".
duration : a value in seconds or milliseconds for the desired time to take to read the element contents. Follows the Times attribute format from the [CSS2] specification. e.g. "250ms", "3s".
volume : the volume for the contained text in the range 0.0 to 100.0 (higher values are louder and specifying a value of zero is equivalent to specifying "silent"). Legal values are: number, a relative change or "silent", "x-soft", "soft", "medium", "loud", "x-loud", or "default". The volume scale is linear amplitude. The default is 100.0.

Number

A number is a simple floating point value without exponentials. Legal formats are "n", "n.", ".n" and "n.n" where "n" is a sequence of one or more digits.

Relative values

Relative changes for the attributes above can be specified

as a percent (a number optionally preceded by "+" or "-" and followed by "%"), e.g. "3%", "+15.2%", "-8.0%", or
as a relative number:
- For the rate and volume attributes, relative changes are a number preceded by "+" or "-", e.g. "+10", "-5.5".
- For the pitch and range attributes, relative changes can be given in semitones (a number preceded by "+" or "-" and followed by "st") or in Hertz (a number preceded by "+" or "-" and followed by "Hz"): "+0.5st", "+5st", "-2st", "+10Hz", "-5.5Hz".

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  The price of XYZ is <prosody rate="-10%">$45</prosody>
</speak>

Pitch contour

The pitch contour is defined as a set of targets at specified time positions in the speech output. The algorithm for interpolating between the targets is processor-specific. In each pair of the form (time position, target), the first value is a percentage of the period of the contained text (a number followed by "%") and the second value is the value of the pitch attribute (a number followed by "Hz", a relative change, or descriptive values are all permitted). Time position values outside 0% to 100% are ignored. If a pitch value is not defined for 0% or 100% then the nearest pitch target is copied. All relative values for the pitch are relative to the pitch value just before the contained text.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <prosody contour="(0%,+20)(10%,+30%)(40%,+10)">
    good morning
  </prosody>
</speak>

The duration attribute takes precedence over the rate attribute. The contour attribute takes precedence over the pitch and range attributes.

The default value of all prosodic attributes is no change. For example, omitting the rate attribute means that the rate is the same within the element as outside.

The following elements can occur within the content of the prosody element: audio, break, emphasis, mark, p, paragraph, phoneme, prosody, say-as, sub, s, sentence, voice.

Limitations

All prosodic attribute values are indicative. If a speech synthesis processor is unable to accurately render a document as specified, (e.g. trying to set the pitch to 1Mhz, or the speaking rate to 1,000,000 words per minute.) it must make a best effort to continue processing by imposing a limit or a substitute for the specified, unsupported value and may inform the host environment when such limits are exceeded.

In some cases, SSML processors may elect to ignore a given prosodic markup if the processor determines, for example, that the indicated value is redundant, improper or in error. In particular, concatenative-type synthetic speech systems that employ large acoustic units may reject prosody-modifying markup elements if they are redundant with the prosody of a given acoustic unit(s) or would otherwise result in degraded speech quality.

2.3 Other Elements

2.3.1 audio Element

The audio element supports the insertion of recorded audio files (see Appendix D for required formats) and the insertion of other audio formats in conjunction with synthesized speech output. The audio element may be empty. If the audio element is not empty then the contents should be the marked-up text to be spoken if the audio document is not available. The alternate content may include text, speech markup, desc elements, or other audio elements. The alternate content may also be used when rendering the document to non-audible output and for accessibility (see the desc element). The required attribute is src, which is the URI of a document with an appropriate MIME type.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
                 
  <!-- Empty element -->
  Please say your name after the tone.  <audio src="beep.wav"/>

  <!-- Container element with alternative text -->
  <audio src="prompt.au">What city do you want to fly from?</audio>
  <audio src="welcome.wav">  
    <emphasis>Welcome</emphasis>  to the Voice Portal. 
  </audio>

</speak>

An audio element is sucessfully rendered:

If the referenced audio source is played, or
If the processor is unable to execute #1 but the alternative content is successfully rendered, or
If the processor can detect that text-only output is required and the alternative content is successfully rendered.

Deciding which conditions result in the alternative content being rendered is processor-dependent. If the audio element is not successfully rendered, an SSML processor should continue processing and should notify the hosting environment. A processor may determine after beginning playback of an audio source that the audio cannot be played in its entirety. For example, encoding problems, network disruptions, etc. may occur. The processor may designate this either as successful or unsuccessful rendering, but it must document this behavior.

The following elements can occur within the content of the audio element: audio, break, desc, emphasis, mark, p, paragraph, phoneme, prosody, say-as, sub, s, sentence, voice.

2.3.2 mark Element

A mark element is an empty element that places a marker into the text/tag sequence. The mark element can be used to reference a specific location in the text/tag sequence, and can additionally be used to insert a marker into an output stream for asynchronous notification. When processing a mark element, an SSML processor must do one or both of the following:

inform the hosting environment with the value of the name attribute and with information allowing the platform to retrieve the corresponding position in the rendered output.
when audio output of the SSML document reaches the mark, issue an event that includes the required name attribute of the element. The hosting environment defines the destination of the event.

The mark element does not affect the speech output process.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
                 
  Go from <mark name="here"/> here, to <mark name="there"/> there!

</speak>

The following elements can occur within the content of the mark element: none.

2.3.3 desc Element

The desc element can only occur within the content of the audio element. When the audio source referenced in audio is not speech, e.g. audio wallpaper or sonicon punctuation, it should contain a desc element whose textual content is a description of the audio source (e.g. "door slamming"). If text-only output is being produced by the SSML processor, the content of the desc element(s) should be rendered instead of other alternative content in audio.

The following elements can occur within the content of the desc element: none.

3. SSML Documents

3.1 Document Form

A legal stand-alone Speech Synthesis Markup Language document must have a legal XML Prolog [XML §2.8].

The XML prolog in a synthesis document comprises the XML declaration and an optional DOCTYPE declaration referencing the synthesis DTD. It is followed by the root speak element. The XML prolog may also contain XML comments, processor instructions and other content permitted by XML in a prolog.

The version number of the XML declaration indicates which version of XML is being used. The version number of the speak element indicates which version of the SSML specification is being used -- "1.0" for this specification. The speak version is a required attribute.

The speak element must designate the SSML namespace using the xmlns attribute [XMLNS]. The namespace for SSML is defined to be http://www.w3.org/2001/10/synthesis.

It is recommended that the speak element also include xmlns:xsi and xsi:schemaLocation attributes to indicate the location of the SSML schema (see Appendix C):

If present, the optional DOCTYPE must reference the standard DOCTYPE and identifier.

The following are two examples of legal SSML headers:

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en">

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
                  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en">

The language for the document is defined by the xml:lang attribute on the speak element. See Section 2.1.2 for details.

The base URI for the document is defined by the xml:base attribute on the speak element. See Section 3.2 for details.

The metadata and lexicon elements must occur before all other elements and text contained within the root speak element. There are no other ordering constraints on the elements in this specification.

3.2 Base URI

Relative URIs are resolved according to a base URI, which may come from a variety of sources. The base URI declaration allows authors to specify a document's base URI explicitly. See Section 3.2.1 for details on the resolution of relative URIs.

The path information specified by the base URI declaration only affects URIs in the document where the element appears.

The base URI declaration is permitted but optional. The two elements affected by it are

audio

The optional src attribute can specify a relative URI.

lexicon

The uri attribute can specify a relative URI.

The xml:base attribute

The base URI declaration follows [XML-BASE] and is indicated by an xml:base attribute on the root speak element.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
         xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:base="http://www.example.com/base-file-path">

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
         xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:base="http://www.example.com/another-base-file-path">

3.2.1 Resolving Relative URIs

User agents must calculate the base URI for resolving relative URIs according to [RFC2396]. The following describes how [RFC2396] applies to synthesis documents.

User agents must calculate the base URI according to the following precedences (highest priority to lowest):

The base URI is set by the xml:base attribute on the speak element (see Section 3.2).
The base URI is given by meta data discovered during a protocol interaction, such as an HTTP header (see [RFC2616]).
By default, the base URI is that of the current document. Not all synthesis documents have a base URI (e.g., a valid synthesis document may appear in an email and may not be designated by a URI). Such synthesis documents are not valid if they contain relative URIs and rely on a default base URI.

3.3 Pronunciation Lexicon

A synthesis document may reference one or more external pronunciation lexicon documents. A lexicon document is identified by a URI with an optional media type.

The pronunciation information contained within a lexicon document is used only for words defined within the enclosing document.

The W3C Voice Browser Working Group is developing the Pronunciation Lexicon Markup Language [LEX]. The specification will address the matching process between words and lexicon entries and the mechanism by which a speech synthesis processor handles multiple pronunciations from internal and synthesis-specified lexicons. Pronunciation handling with proprietary lexicon formats will necessarily be specific to the synthesis processor.

Pronunciation lexicons are necessarily language-specific. Pronunciation lookup in a lexicon and pronunciation inference for any word may use an algorithm that is language-specific.

The lexicon element

Any number of lexicon elements may occur as immediate children of the speak element. The lexicon element must have a uri attribute specifying a URI that identifies the location of the pronunciation lexicon document.

Issue: There has been some discussion as to whether the lexicon element should be permitted to occur within the content of elements other than speak. Reviewers are especially encouraged to provide feedback on this point.

The lexicon element may have a type attribute that specifies the media type of the pronunciation lexicon document.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
                  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
       xml:lang="en">

  <lexicon uri="http://www.example.com/lexicon.file"/>
  <lexicon uri="http://www.example.com/strange-words.file"           type="media-type"/>
  ...
</speak>

The following elements can occur within the content of the lexicon element: none.

3.4 Meta data

The metadata element is a container in which information about the document can be placed using a metadata schema. Although any metadata schema can be used with metadata, it is recommended that the Resource Description Format (RDF) schema [RDF-SCHEMA] be used in conjunction with the general metadata properties defined in the Dublin Core Metadata Initiative [DC].

RDF is a declarative language and provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of items on the Web. Content creators should refer to W3C metadata Recommendations [RDF-SYNTAX] and [RDF-SCHEMA] when deciding which metadata RDF schema to use in their documents. Content creators should also refer to the Dublin Core Metadata Initiative [DC], which is a set of generally applicable core metadata properties (e.g., Title, Creator, Subject, Description, Copyrights, etc.).

Document properties declared with the metadata element can use any metadata schema.

Informative: This is an example of how metadata can be included in a speech synthesis document using the Dublin Core version 1.0 RDF schema [DC] describing general document information such as title, description, date, and so on:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
                  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
       xml:lang="en">
    
  <metadata>
   <rdf:RDF
       xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
       xmlns:rdfs = "http://www.w3.org/TR/1999/PR-rdf-schema-19990303#"
       xmlns:dc = "http://purl.org/metadata/dublin_core#">

   <!-- Metadata about the synthesis document -->
   <rdf:Description about="http://www.example.com/meta.ssml"
       dc:Title="Hamlet-like Soliloquy"
       dc:Description="Aldine's Soliloquy in the style of Hamlet"
       dc:Publisher="W3C"
       dc:Language="en"
       dc:Date="2002-11-29"
       dc:Rights="Copyright 2002 Aldine Turnbet"
       dc:Format="application/ssml+xml" >                
       <dc:Creator>
          <rdf:Seq ID="CreatorsAlphabeticalBySurname">
             <rdf:li>William Shakespeare</rdf:li>
             <rdf:li>Aldine Turnbet</rdf:li>
          </rdf:Seq>
       </dc:Creator>
   </rdf:Description>
  </rdf:RDF>
 </metadata>

</speak>

The following SSML elements can occur within the content of the metadata element: none.

3.5 Integration With Other Markup Languages

3.5.1 SMIL

The Synchronized Multimedia Integration Language (SMIL, pronounced "smile") [SMIL] enables simple authoring of interactive audiovisual presentations. SMIL is typically used for "rich media"/multimedia presentations which integrate streaming audio and video with images, text or any other media type. SMIL is an easy-to-learn HTML-like language, and many SMIL presentations are written using a simple text-editor. See the SMIL/SSML integration examples in Appendix A.

3.5.2 ACSS

Aural style sheets [CSS2, Section 19] are employed to augment standard visual forms of documents (like HTML) with additional elements that assist in the synthesis of the text into audio. In comparison to SSML, ACSS-generated documents are capable of more complex specifications of the audio sequence, including the designation of 3D location of the audio source. Many of the other ACSS elements overlap SSML functionality, especially in the specification of voice type/quality. SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.

3.5.3 VoiceXML

The Voice Extensible Markup Language [VoiceXML] enables Web-based development and content-delivery for interactive voice response applications. VoiceXML supports speech synthesis, recording and playback of digitized audio, speech recognition, DTMF input, telephony call control, and form-driven mixed initiative dialogs. VoiceXML 2.0 extends SSML for the markup of text to be synthesized. For an example of the integration between VoiceXML and SSML see [Appendix A].

3.6 SSML Document Fetching

The fetching and caching behavior of SSML documents is defined by the environment in which the SSML processor operates. In a VoiceXML interpreter context for example, the caching policy is determined by the VoiceXML interpreter.

4. Conformance

4.1 Conforming Speech Synthesis Markup Language Fragments

A synthesis document fragment is a Conforming Speech Synthesis Markup Language Fragment if:

it is a well-formed XML document [XML] conforming to namespaces in XML [XMLNS]
and it conforms to the criteria for Conforming Stand-Alone Speech Synthesis Markup Language Documents after:
- all non-synthesis namespace elements and attributes and all xmlns attributes which refer to non-synthesis namespace elements are removed from the document,
- and, an appropriate XML declaration (i.e., <?xml...?>) is included at the top of the document,
- and, if the speak element does not already designate the synthesis namespace using the xmlns attribute, then xmlns="http://www.w3.org/2001/10/synthesis" is added to the element.

4.2 Conforming Stand-Alone Speech Synthesis Markup Language Documents

A document is a Conforming Stand-Alone Speech Synthesis Markup Language Document if it meets both the following conditions:

It is a well-formed XML document [XML] conforming to namespaces in XML [XMLNS].
It is a valid XML document which adheres to the specification described in this document (Speech Synthesis Markup Language Specification) including the constraints expressed in the Schema (see Appendix C) and having an XML Prolog and speak root element as specified in Section 3.1.

The SSML specification and these conformance criteria provide no designated size limits on any aspect of synthesis documents. There are no maximum values on the number of elements, the amount of character data, or the number of characters in attribute values.

4.3 Using SSML with other Namespaces

The synthesis namespace may be used with other XML namespaces as per the Namespaces in XML Recommendation [XMLNS]. Future work by W3C will address ways to specify conformance for documents involving multiple namespaces.

4.4 Conforming Speech Synthesis Markup Language Processors

A Speech Synthesis Markup Language processor is a program that can parse and process Speech Synthesis Markup Language documents.

In a Conforming Speech Synthesis Markup Language Processor, the XML parser must be able to parse and process all XML constructs defined by XML 1.0 [XML] and Namespaces in XML [XMLNS]. This XML parser is not required to perform validation of a Speech Synthesis Markup Language document as per its schema or DTD; this implies that during processing of a Speech Synthesis Markup Language document it is optional to apply or expand external entity references defined in an external DTD.

A Conforming Speech Synthesis Markup Language Processor must correctly understand and apply the semantics of each markup element as described by this document.

A Conforming Speech Synthesis Markup Language Processor must meet the following requirements for handling of languages:

A Conforming Speech Synthesis Markup Language Processor is required to parse all legal language declarations successfully.
A Conforming Speech Synthesis Markup Language Processor may be able to apply the semantics of markup languages which refer to more than one language. When a processor is able to support each language in the set but is unable to handle them concurrently it should inform the hosting environment. When the set includes one or more languages that are not supported by the processor it should inform the hosting environment.
A Conforming Speech Synthesis Markup Language Processor may implement languages by approximate substitutions according to a documented, processor-specific behavior. For example, using a US English synthesis processor to process British English input.

When a Conforming Speech Synthesis Markup Language Processor encounters elements or attributes in a non-synthesis namespace it may:

ignore the non-standard elements and/or attributes
or, process the non-standard elements and/or attributes
or, reject the document containing those elements and/or attributes

There is, however, no conformance requirement with respect to performance characteristics of the Speech Synthesis Markup Language Processor. For instance, no statement is required regarding the accuracy, speed or other characteristics of speech produced by the processor. No statement is made regarding the size of input that a Speech Synthesis Markup Language Processor must support.

4.5 Conforming User Agent

A Conforming User Agent is a Conforming Speech Synthesis Markup Language Processor that is capable of accepting a Speech Synthesis Markup Language document as input and producing a spoken output by using the information contained in the markup to render the document as intended by the author.

Since the output cannot be guaranteed to be a correct representation of all the markup contained in the input there is no conformance requirement regarding accuracy. A conformance test may, however, require some examples of correct synthesis of a reference document to determine conformance.

5. References

5.1 Normative References

[CSS2]: World Wide Web Consortium, Cascading Style Sheets, level 2 CSS2 Specification. W3C Recommendation. See http://www.w3.org/TR/REC-CSS2/
[IPA]: International Phonetic Association. International Phonetic Alphabet. Department of Linguistics, University of Victoria, Victoria, British Columbia, Canada, 1996. See http://www2.arts.gla.ac.uk/IPA/fullchart.html
[RFC1521]: N. Borenstein and N. Freed, MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies, September 1993. See http://www.ietf.org/rfc/rfc1521.txt
[RFC2045]: N. Freed and N. Borenstein Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies. IETF RFC 2045. November, 1996. See http://www.ietf.org/rfc/rfc2045.txt
[RFC2046]: N. Freed and N. Borenstein Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types IETF RFC 2046. November, 1996. See http://www.ietf.org/rfc/rfc2046.txt
[RFC2119]: S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, Harvard University, March 1997. See http://www.normos.org/ietf/rfc/rfc2119.txt
[RFC2396]: T. Berners-Lee, R. Fielding, U.C. Irvine, L. Masinter. Uniform Resource Identifiers (URI): Generic Syntax IETF RFC 2396. 1998. See http://www.ietf.org/rfc/rfc2396.txt
[RFC3066]: H. Alvestrand, Tags for the Identification of Languages. See http://www.ietf.org/rfc/rfc3066.txt
[SCHEMA2]: P.V. Biron, A. Malhotra XML Schema Part 2: Datatypes. W3C Recommendation, May 2001. See http://www.w3.org/TR/xmlschema-2/
[TYPES]: List of media types (MIME types) registered with IANA. See http://www.iana.org/assignments/media-types/index.html
[XML]: World Wide Web Consortium. Extensible Markup Language (XML) 1.0 (Second Edition). W3C Recommendation, 6 October 2000. See http://www.w3.org/TR/2000/REC-xml-20001006
[XML-BASE]: J. Marsh, editor. XML Base. W3C Recommendation, June 2001. See http://www.w3.org/TR/2001/REC-xmlbase-20010627/.
[XMLNS]: World Wide Web Consortium. Namespaces in XML. W3C Recommendation. See http://www.w3.org/TR/REC-xml-names/

5.2 Informative References

[DC]: Dublin Core Metadata Initiative. See http://dublincore.org/
[ISO3166]: Codes for the representation of names of countries. The International Organization for Standardization, 3rd edition, 15 August 1988. See http://www.iso.org/iso/en/prods-services/iso3166ma/index.html.
[JSML]: Sun Microsystems. JSpeech Markup Language. Sun Microsystems submission to W3C, 5 June 2000. See http://www.w3.org/TR/jsml/
[LEX]: World Wide Web Consortium. Pronunciation Lexicon Markup Requirements. W3C Working Draft, 12th March 2001. See http://www.w3.org/TR/lexicon-reqs/
[RDF-SYNTAX]: Ora Lassila and Ralph R. Swick. Resource Description Framework (RDF) Model and Syntax Specification. W3C Recommendation, 22 February 1999. See http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/
[RDF-SCHEMA]: Dan Brickley and R.V. Guha. Resource Description Framework (RDF) Schema Specification 1.0. W3C Candidate Recommendation, March 2000. See http://www.w3.org/TR/2000/CR-rdf-schema-20000327/
[REQS]: World Wide Web Consortium. Speech Synthesis Markup Requirements for Voice Markup Languages. W3C Working Draft. See http://www.w3.org/TR/voice-tts-reqs/
[RFC2616]: R. Fielding, et al., Hypertext Transfer Protocol -- HTTP/1.1. IETF RFC 2616. 1999. See http://www.ietf.org/rfc/rfc2616.txt
[RFC2732]: R. Hinden, B. Carpenter, L. Masinter. Format for Literal IPv6 Addresses in URL's. IETF RFC 2732. 1999. See http://www.ietf.org/rfc/rfc2732.txt
[SABLE]: Richard Sproat, Andrew Hunt, Mari Ostendorf, Paul Taylor, Alan Black, Kevin Lenzo, Mike Edgington, SABLE: A Standard for TTS Markup, International Conference on Spoken Language Processing, 1998.
[SMIL]: World Wide Web Consortium. Synchronized Multimedia Integration Language (SMIL 2.0). W3C Recommendation. See http://www.w3.org/TR/smil20/
[UNICODE]: The Unicode Consortium. The Unicode Standard. See http://www.unicode.org/unicode/standard/versions/
[VXML]: World Wide Web Consortium. Voice Extensible Markup Language (VoiceXML) Version 2.0. W3C Working Draft. See http://www.w3.org/TR/2002/WD-voicexml20-20020424/

6. Acknowledgements

This document was written with the participation of the following members of the W3C Voice Browser Working Group (listed in alphabetical order):

Paolo Baggia, Loquendo
Dan Burnett, Nuance
Jerry Carter, SpeechWorks International
Sasha Caskey, SpeechWorks International
Brian Eberman, SpeechWorks International
Andrew Hunt, SpeechWorks International
Jim Larson, Intel
Bruce Lucas, IBM
Scott McGlashan, PipeBeach
T.V. Raman, IBM
Dave Raggett, W3C/Openwave
Richard Sproat, ATT
Luc Van Tichelen, ScanSoft
Kuansan Wang, Microsoft
Mark Walker, Intel

Appendix A: Example SSML

This appendix is Informative.

The following is an example of reading headers of email messages. The paragraph and sentence elements are used to mark the text structure. The break element is placed before the time and has the effect of marking the time as important information for the listener to pay attention to. The prosody element is used to slow the speaking rate of the email subject so that the user has extra time to listen and write down the details.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
                  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en">
  <paragraph>
    <sentence>You have 4 new messages.</sentence>
    <sentence>The first is from Stephanie Williams
      and arrived at <break/> 3:45pm.
    </sentence>
    <sentence>
      The subject is <prosody rate="-20%">ski trip</prosody>
    </sentence>

  </paragraph>
</speak>

The following example combines audio files and different spoken voices to provide information on a collection of music.

<?xml version="1.0" encoding="ISO-8859-1?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
                  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en">

  <paragraph>
    <voice gender="male">
      <sentence>Today we preview the latest romantic music from
        the W3C.</sentence>

      <sentence>Hear what the Software Reviews said about Tim Lee's
        newest hit.</sentence>
    </voice>
  </paragraph>

  <paragraph>
    <voice gender="female">
      He sings about issues that touch us all.
    </voice>
  </paragraph>

  <paragraph>
    <voice gender="male">
      Here's a sample.  <audio src="http://www.w3c.org/music.wav"/>
      Would you like to buy it?
    </voice>
  </paragraph>

</speak>

SMIL Integration Example

The SMIL language [SMIL] is an XML-based multimedia control language. It is especially well suited for describing dynamic media applications that include synthetic speech output.

File 'greetings.ssml' contains the following:

<?xml version="1.0" encoding="ISO-8859-1?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
                  "http://www.w3.org/TR/speech-synthesis/synthesis.dtd">

<speak xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en">

  <sentence>
   <mark name="greetings"/>
    <emphasis>Greetings</emphasis>
      from the <sub alias="World Wide Web Consortium">W3C</sub>!
  </sentence>
</speak>

SMIL Example 1: W3C logo image appears, and then one second later, the speech sequence is rendered. File 'greetings.smil' contains the following:

<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
  <head>
    <top-layout width="640" height="320">
      <region id="whole" width="640" height="320"/>
    </top-layout>
  </head>
  <body>
    <par>
      <img src="http://w3clogo.gif" region="whole" begin="0s"/>
      <ref src="greetings.ssml" begin="1s"/>
    </par>
  </body>
</smil>

SMIL Example 2: W3C logo image appears, then clicking on the image causes it to disappear and the speech sequence to be rendered. File 'greetings.smil' contains the following:

<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
  <head>
    <top-layout width="640" height="320">
      <region id="whole" width="640" height="320"/>
    </top-layout>
  </head>
  <body>
    <seq>
      <img id="logo" src="http://w3clogo.gif" region="whole"
           begin="0s" end="logo.activateEvent"/>
      <ref src="greetings.ssml"/>
    </seq>
  </body>
</smil>

VoiceXML Integration Example

The following is an example of SSML in VoiceXML (see Section 3.5.3).

<?xml version="1.0" encoding="UTF-8"?> 
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xsi:schemaLocation="http://www.w3.org/2001/vxml 
   http://www.w3.org/TR/voicexml20/vxml.xsd">
   <form>
      <block>
         <prompt>
           <emphasis>Welcome<emphasis> to the Bird Seed Emporium.
           <audio src="rtsp://www.birdsounds.example.com/thrush.wav"/>
           We have 250 kilogram drums of thistle seed for
           $299.95
           plus shipping and handling this month.
           <audio src="http://www.birdsounds.example.com/mourningdove.wav"/>
         </prompt>
      </block>
   </form>
</vxml>

Appendix B: DTD for the Speech Synthesis Markup Language

This appendix is Informative.

The synthesis DTD is located at http://www.w3.org/TR/speech-synthesis/synthesis.dtd.

Due to DTD limitations, the SSML DTD does not correctly express that the metadata element can contain elements from other XML namespaces.

Appendix C: Schema for the Speech Synthesis Markup Language

This appendix is Normative.

The synthesis schema is located at http://www.w3.org/TR/speech-synthesis/synthesis.xsd.

Note: the synthesis schema includes a no-namespace core schema, located at http://www.w3.org/TR/speech-synthesis/synthesis-core.xsd, which may be used as a basis for specifying Speech Synthesis Markup Language Fragments [Sec. 4.1] embedded in non-synthesis namespace schemas.

Appendix D: Audio File Formats

This appendix is Normative.

SSML requires that a platform support the playing of the audio formats specified below.

Audio Format	Media Type
Raw (headerless) 8kHz 8-bit mono mu-law [PCM] single channel. (G.711)	audio/basic (from [RFC1521])
Raw (headerless) 8kHz 8 bit mono A-law [PCM] single channel. (G.711)	audio/x-alaw-basic
WAV (RIFF header) 8kHz 8-bit mono mu-law [PCM] single channel.	audio/wav
WAV (RIFF header) 8kHz 8-bit mono A-law [PCM] single channel.	audio/wav

The 'audio/basic' mime type is commonly used with the 'au' header format as well as the headerless 8-bit 8Khz mu-law format. If this mime type is specified for recording, the mu-law format must be used. For playback with the 'audio/basic' mime type, processors must support the mu-law format and may support the 'au' format.

Appendix E: MIME Types and File Suffix

This appendix is Informative.

The W3C Voice Browser Working Group has applied to IETF to register a MIME type for the Speech Synthesis Markup Language. The current proposal is to use "application/ssml+xml".

The W3C Voice Browser Working Group has adopted the convention of using the ".ssml" filename suffix for Speech Synthesis Markup Language documents where speak is the root element.

Appendix F: Features Under Consideration for Future Versions

This appendix is Informative.

The following features are under consideration for versions of the Speech Synthesis Markup Language Specification after version 1.0:

Support for other phoneme alphabets.
"Lowlevel" elements (fine-grained acoustic-prosodic control)
Intonational control elements
"Value" element for insertion of expressions
Support for pronunciation lexicon (when it becomes available)

Selecting Voices with a specific accent

Authors may wish to style speech by selecting a voice with a regional accent. For instance, one might wish to use a Scottish accent for speaking some English text. One way to achieve this is to request a named voice known to have the desired accent. The names for such voices are generally vendor specific. Further discussions may lead to the emergence of conventions for naming voices with specific regional accents, and in principle, could result in an extended set of generic voice names for SSML.

A partial work around is to use the xml:lang attribute, which is defined by the XML 1.0 [XML] specification to describe the language in which the document content is written. The values of this attribute are language identifiers as defined by RFC 3066 [RFC3066]. These identifiers can be used to identify country-wide variants of languages, based upon the use of ISO 3166 [ISO3166] country codes. Thus "en-us" denotes US English, while "en-gb" denotes UK English.

This offers a limited means to influence which accent is selected, through the choice of the corresponding ISO 3166 country code. Unfortunately, there is no standard for designating regions within countries, as would be needed for a portable way to request accents such as Scottish or Welsh.

Appendix G: Internationalization

This appendix is Normative.

SSML is an application of XML 1.0 [XML] and thus supports [UNICODE] which defines a standard universal character set.

Additionally, SSML provides a mechanism for precise control of the input and output languages via the use of the xml:lang attribute. This facility provides:

The ability to specify the input and output language overriding the SSML Processor default language
The ability to produce multi-language output
The ability to accept input in a language which is different from the language employed in the spoken output.

Speech Synthesis Markup Language Version 1.0

W3C Working Draft 02 December 2002

Abstract

Status of this Document

0. Table of Contents

1. Introduction

1.1 Vocabulary and Design Concepts

1.2 Speech Synthesis Processes

1.3 Document Generation, Applications and Contexts

1.4 Platform-Dependent Output Behavior of Speech Synthesis Content

1.5 Terminology

2. Elements and Attributes

2.1 Document Structure, Text Processing and Pronunciation

2.1.1 speak Root Element

2.1.2 xml:lang Attribute: Language

2.1.3 paragraph and sentence: Text Structure Elements

2.1.4 say-as Element

The interpret-as and format attributes

The detail attribute

2.1.5 phoneme Element

2.1.6 sub Element

2.2 Prosody and Style

2.2.1 voice Element

2.2.2 emphasis Element

2.2.3 break Element

2.2.4 prosody Element

Number

Relative values

Pitch contour

Limitations

2.3 Other Elements

2.3.1 audio Element

2.3.2 mark Element

2.3.3 desc Element

3. SSML Documents

3.1 Document Form

3.2 Base URI

The xml:base attribute

3.2.1 Resolving Relative URIs

3.3 Pronunciation Lexicon

The lexicon element

3.4 Meta data

3.5 Integration With Other Markup Languages

3.5.1 SMIL

3.5.2 ACSS

3.5.3 VoiceXML

3.6 SSML Document Fetching

4. Conformance

4.1 Conforming Speech Synthesis Markup Language Fragments

4.2 Conforming Stand-Alone Speech Synthesis Markup Language Documents

4.3 Using SSML with other Namespaces

4.4 Conforming Speech Synthesis Markup Language Processors

4.5 Conforming User Agent

5. References

5.1 Normative References

5.2 Informative References

6. Acknowledgements

Appendix A: Example SSML

SMIL Integration Example

VoiceXML Integration Example

Appendix B: DTD for the Speech Synthesis Markup Language

Appendix C: Schema for the Speech Synthesis Markup Language

Appendix D: Audio File Formats

Appendix E: MIME Types and File Suffix

Appendix F: Features Under Consideration for Future Versions

Selecting Voices with a specific accent

Appendix G: Internationalization

2.1.2 `xml:lang` Attribute: Language

The `interpret-as` and `format` attributes

The `detail` attribute