Copyright © 2002 W3C ® ( MIT , INRIA , Keio ), All Rights Reserved. W3C liability , trademark , document use , and software licensing rules apply.
The Voice Browser Working Group has sought to develop standards to enable access to the Web using spoken interaction. The Speech Synthesis Markup Language Specification is part of this set of new markup specifications for voice browsers, and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.
This is a W3C Last Call Working Draft for review by W3C Members and other interested parties. Last Call means that the Working Group believes that this specification is technically sound and therefore wishes this to be the last call for comments. If the feedback is positive, the Working Group plans to submit it for consideration as a W3C Candidate Recommendation. Comments can be sent until 15 January 2003.
Reviewers are encouraged to subscribe to the public discussion list <www-voice@w3.org> and to mail in comments as soon as possible. To subscribe, send an email to <www-voice-request@w3.org> with the word subscribe in the subject line (include the word unsubscribe to unsubscribe). A public archive is available on-line. Following the publication of a previous Last Call Working Draft of this specification, the group received a number of public comments. Those comments have not been addressed in this current document but will be addressed along with any other comments received during the review period for this document. Commenters who have sent their comments to the public mailing list need not resubmit their comments in order for them to be addressed as part of the Last Call review.
This specification describes markup for generating synthetic speech via a speech synthesizer, and forms part of the proposals for the W3C Speech Interface Framework. This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group (W3C Members only). This is a Royalty Free Working Group, as described in W3C's Current Patent Practice Note. Working Group participants are required to provide patent disclosures.
Although an Implementation Report Plan has not yet been developed for this specification, the Working Group currently expects to require at least two independently developed interoperable implementations of each required feature, and at least one implementation of each feature, in order to exit the next phase of this document, the Candidate Recommendation phase. To help the Voice Browser Working Group build such a report, reviewers are encouraged to implement this specification and to indicate to W3C which features have been implemented, and any problems that arose.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite W3C Working Drafts as other than "work in progress". A list of current W3C Recommendations and other technical reports can be found at http://www.w3.org/TR/.
The W3C Standard is known as the Speech Synthesis Markup Language specification (SSML) and is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc., California, U.S.A. The JSML specification can be found at [JSML].
SSML is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms. A related initiative to estabilish a standard system for marking up text input is SABLE [SABLE].
There is some variance in the use of technical vocabulary in the
speech synthesis community. The following definitions establish a
common understanding for this document.
Voice Browser | A device which interprets a (voice) markup language and is capable of generating voice output and/or interpreting voice input, and possibly other input/output modalities. |
Speech Synthesis | The process of automatic generation of speech output from data input which may include plain text, formatted text or binary objects. |
Text-To-Speech | The process of automatic generation of speech output from text or annotated text input. |
The design and standardization process has followed from the Speech Synthesis Markup Requirements for Voice Markup Languages [REQS].
The following items were the key design criteria.
A Text-To-Speech system (a synthesis processor) that supports SSML will be responsible for rendering a document as spoken output and for using the information contained in the markup to render the document as intended by the author.
Document creation: A text document provided as input to the synthesis processor may be produced automatically, by human authoring, or through a combination of these forms. SSML defines the form of the document.
Document processing: The following are the six major processing steps undertaken by a synthesis processor to convert marked-up text input into automatically generated voice output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control the final voice output.
XML Parse: An XML parser is used to extract the document tree and content from the incoming text document. The structure, tags and attributes obtained in this step influence each of the following steps.
Structure analysis: The structure of a document influences the way in which a document should be read. For example, there are common speaking patterns associated with paragraphs and sentences.
- Markup support: The paragraph and sentence elements defined in SSML explicitly indicate document structures that affect the speech output.
- Non-markup behavior: In documents and parts of documents where these elements are not used, the synthesis processor is responsible for inferring the structure by automated analysis of the text, often using punctuation and other language-specific data.
Text normalization: All written languages have special constructs that require a conversion of the written form (orthographic form) into the spoken form. Text normalization is an automated process of the synthesis processor that performs this conversion. For example, for English, when "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two" and so on.
- Markup support: The say-as element can be used in the input document to explicitly indicate the presence and type of these constructs and to resolve ambiguities. The set of constructs that can be marked has not yet been defined but might include dates, times, numbers, acronyms, currency amounts and more.
- Non-markup behavior: For text content that is not marked with the say-as element the synthesis processor is expected to make a reasonable effort to automatically locate and convert these constructs to a speakable form. Because of inherent ambiguities (such as the "1/2" example above) and because of the wide range of possible constructs in any language, this process may introduce errors in the speech output and may cause different processors to render the same document differently.
Text-to-phoneme conversion: Once the processor has determined the set of words to be spoken it must convert those words to a string of phonemes. A phoneme is the basic unit of sound in a language. Each language (and sometimes each national or dialect variant of a language) has a specific phoneme set: e.g., most US English dialects have around 45 phonemes. In many languages this conversion is ambiguous since the same written word may have many spoken forms. For example, in English, "read" may be spoken as "reed" (I will read the book) or "red" (I have read the book). Another issue is the handling of words with non-standard spellings or pronunciations. For example, an English synthesis processor will often have trouble determining how to speak some non-English-origin names; e.g. "Tlalpachicatl" which has a Mexican/Aztec origin.
- Markup support: The phoneme element allows a phonemic sequence to be provided for any word or word sequence. This provides the content creator with explicit control over pronunciations. The say-as element might also be used to indicate that text is a proper name that may allow a synthesis processor to apply special rules to determine a pronunciation. The lexicon element can be used to reference external definitions of pronunciations.
- Non-markup behavior: In the absence of a phoneme element the synthesis processor must apply automated capabilities to determine pronunciations. This is typically achieved by looking up words in a pronunciation dictionary and applying rules to determine other pronunciations. Most processors are expert at performing text-to-phoneme conversions so most words of most documents can be handled automatically.
Prosody analysis: Prosody is the set of features of speech output that includes the pitch (also called intonation or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words and many other features. Producing human-like prosody is important for making speech sound natural and for correctly conveying the meaning of spoken language.
- Markup support: The emphasis element, break element and prosody element may all be used by document creators to guide the synthesis processor in generating appropriate prosodic features in the speech output.
- Non-markup behavior: In the absence of these elements, synthesis processors are expert (but not perfect) in automatically generating suitable prosody. This is achieved through analysis of the document structure, sentence syntax, and other information that can be inferred from the text input.
Waveform production: The phonemes and prosodic information are used by the synthesis processor in the production of the audio waveform. There are many approaches to this processing step so there may be considerable platform-specific variation.
- Markup support: SSML does not provide explicit controls over the generation of waveforms. The voice element allows the document creator to request a particular voice or specific voice qualities (e.g. a young male voice). The audio element allows for insertion of recorded audio data into the output stream.
There are many classes of document creator that will produce marked-up documents to be spoken by a synthesis processor. Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in the previous section. The following are some of the common cases.
The document creator has no access to information to mark up the text. All processing steps in the synthesis processor must be performed fully automatically on raw text. The document requires only the containing speak element to indicate the content is to be spoken.
When marked text is generated programmatically the creator may have specific knowledge of the structure and/or special text constructs in some or all of the document. For example, an email reader can mark the location of the time and date of receipt of email. Such applications may use elements that affect structure, text normalization, prosody and possibly text-to-phoneme conversion.
Some document creators make considerable effort to mark as many details of the document to ensure consistent speech quality across platforms and to more precisely specify output qualities. In these cases, the markup may use any or all of the available elements to tightly control the speech output. For example, prompts generated in telephony and voice browser applications may be fine-tuned to maximize the effectiveness of the overall system.
The most advanced document creators may skip the higher-level markup (structure, text normalization, text-to-phoneme conversion, and prosody analysis) and produce low-level TTS markup for segments of documents or for entire documents. This typically requires tools to generate sequences of phonemes, plus pitch and timing information. For instance, tools that do "copy synthesis" or "prosody transplant" try to emulate human speech by copying properties from recordings.
The following are important instances of architectures or designs from which marked-up synthesis documents will be generated. The language design is intended to facilitate each of these approaches.
Dialog language: It is a requirement that it should be possible to include documents marked with SSML into the dialog description document to be produced by the Voice Browser Working Group.
Interoperability with Aural CSS : Any HTML processor that is Aural CSS-enabled can produce SSML. ACSS is covered in Section 19 of the Cascading Style Sheets, level 2 (CSS2) Specification [CSS2]. This usage of speech synthesis facilitates improved accessibility to existing HTML and XHTML content.
Application-specific style sheet processing: As mentioned above, there are classes of application that have knowledge of text content to be spoken, and this can be incorporated into the speech synthesis markup to enhance rendering of the document. In many cases, it is expected that the application will use style sheets to perform transformations of existing XML documents to speech synthesis markup. This is equivalent to the use of ACSS with HTML and once again SSML is the "final form" representation to be passed to the synthesis processor. In this context, SSML may be viewed as a superset of ACSS [CSS2, Section 19] capabilities, excepting spatial audio.
SSML provides a standard way to specify gross properties of synthetic speech production such as pronunciation, volume, pitch, rate and etc. Exact specification of synthetic speech output behavior across disparate processors, however, is beyond the scope of this document.
anyURI
' primitive as
defined in XML Schema Part 2: Datatypes [SCHEMA2 §3.2.17]. The Schema
definition follows [RFC2396] and [RFC2732]. Any relative URI reference must
be resolved according to the rules given in Section 3.2.1. In this specification URIs are
provided as attributes to elements, for example in the audio and lexicon elements.[See Appendix E for information on media types for SSML.]
The following elements are defined in this specification.
The Speech Synthesis Markup Language is an XML application. The
root element is speak. xml:lang
is a defined attribute specifying
the language of the root document. xml:base
is a defined
attribute specifying the Base URI of the
root document. The version
attribute is a
required attribute that indicates the version of the specification
to be used for the document. The version number for this
specification is 1.0.
<?xml version="1.0" encoding="ISO-8859-1?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> ... the body ... </speak>
The following elements can occur within the content of the speak element: audio, break, emphasis, lexicon, mark, metadata, p, paragraph, phoneme, prosody, say-as, sub, s, sentence, voice.
xml:lang
Attribute: LanguageFollowing the XML 1.0 [XML §2.12],
languages are indicated by an xml:lang
attribute on the
enclosing element with the value being a language identifier.
Language information is inherited down the document hierarchy, i.e. it has to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.
xml:lang
is a defined attribute for the voice, speak, paragraph, sentence, p, and s elements.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <paragraph>I don't speak Japanese.</paragraph> <paragraph xml:lang="ja">Nihongo-ga wakarimasen.</paragraph> </speak>
The speech synthesis processor largely determines behavior in
the case that a document requires speech output in a language not
supported by the processor. Specifying xml:lang
does not imply a
change in voice, though this may indeed occur. When a given voice
is unable to speak content in the indicated language, a new voice
may be selected by the processor. No change in the voice or prosody
should occur if the xml:lang
value is the same as the inherited
value. Further information about voice selection appears in Section 2.2.1.
There may be variation across conformant processors in the
implementation of xml:lang
for different markup elements (e.g.
paragraph and sentence elements).
All elements should process their contents specific to the enclosing language. For instance, the phoneme, emphasis, break, paragraph and sentence elements should each be rendered in a manner that is appropriate to the current language.
The text normalization processing step may be affected by the enclosing language. This is true for both markup support by the say-as element and non-markup behaviour. In the following example the same text "2/1/2000" may be read as "February first two thousand" in the first sentence, following American English pronunciation rules, but as "the second of January two thousand" in the second one, which follows Italian preprocessing rules.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <sentence>Today, 2/1/2000.</sentence> <!-- Today, February first two thousand --> <sentence xml:lang="it">Un mese fà, 2/1/2000.</sentence> <!-- Un mese fà, il due gennaio duemila --> <!-- One month ago, the second of January two thousand --> </speak>
A paragraph element represents the paragraph structure in text. A sentence element represents the sentence structure in text. For brevity, the markup also supports p and s as exact equivalents of paragraph and sentence.
xml:lang
is a defined attribute on the paragraph, sentence, p and s elements.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <paragraph> <sentence>This is the first sentence of the paragraph.</sentence> <sentence>Here's another sentence.</sentence> </paragraph> </speak>
The use of paragraph and sentence elements is optional. Where text occurs without an enclosing paragraph or sentence elements the speech synthesis processor should attempt to determine the structure using language-specific knowledge of the format of plain text.
The following elements can occur within the content of the paragraph or p elements: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, s, sentence, voice.
The following elements can occur within the content of the sentence or s elements: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, voice.
The say-as element allows the author to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.
Defining a comprehensive set of text format types is difficult because of the variety of languages that must be considered and because of the innate flexibility of written languages. SSML only specifies the say-as element, its attributes, and their purpose. It does not enumerate the possible values for the attributes. The Working Group expects to produce a separate document that will define standard values and associated normative behavior for these values. Examples given here are only for illustrating the purpose of the element and the attributes.
The say-as element has
three attributes: interpret-as
, format
, and detail
. The
interpret-as
attribute is always required;
the other two attributes are optional. The legal values for the
format
attribute depend on the value of
the interpret-as
attribute.
The following elements can occur within the content of the say-as element: none.
interpret-as
and format
attributesThe interpret-as
attribute indicates
the content type of the contained text construct. Specifying the
content type helps the SSML processor to distinguish and interpret
text constructs that may be rendered in different ways depending on
what type of information is intended. In addition, the optional
format
attribute can give further hints on
the precise formatting of the contained text for content types that
may have ambiguous formats.
When specified, the interpret-as
and
format
values are to be interpreted by the
SSML processor as hints provided by the markup document author to
aid text normalization and pronunciation.
In all cases, the text enclosed by any say-as element is intended to be a standard, orthographic form of the language currently in context. An SSML processor should be able to support the common, orthographic forms of the specified language for every content type that it supports.
When the value for the interpret-as
attribute is unknown or unsupported by a processor, it must render
the contained text as if no interpret-as
value were specified.
When the value for the format
attribute
is unknown or unsupported by a processor, it must render the
contained text as if no format
value were
specified, and should render it using the interpret-as
value that is specified.
When the content of the element does not match the content type and/or format specified, an SSML processor should proceed and attempt to render the information. When the content of the element contains other text in addition to the indicated content type, the SSML processor must attempt to render such text.
Indicating the content type or format does not necessarily affect the way the information is pronounced. An SSML processor should pronounce the contained text in a manner in which such content is normally produced for the locale.
Example values for the interpret-as
and
format
attributes: (please note that these
values are just for illustration; they are not suggested or
endorsed values)
interpret-as | format | interpretation | Examples |
number |
ordinal |
interpret the content as an ordinal number | <say-as interpret-as="number" format="ordinal">5</say-as> : fifth |
number |
cardinal |
interpret the content as a cardinal number | <say-as interpret-as="number" format="cardinal">VII</say-as> : seven |
number |
telephone |
interpret the content as a telephone number | <say-as interpret-as="number" format="telephone">123-456-7890</say-as> |
date |
mdy |
interpret the content as a date in month-day-year format | <say-as interpret-as="date" format="mdy">5/12/2003</say-as> : May twelfth, two thousand three |
digits |
interpret the content as digits | <say-as interpret-as="digits"> 123 < /say-as> : one two three | |
ordinal |
interpret the content as an ordinal number | <say-as interpret-as="ordinal"> 123 < /say-as> : one hundred and twenty third | |
cardinal |
interpret the content as an cardinal number | <say-as interpret-as="cardinal"> 123 < /say-as> : one hundred and twenty three | |
letters |
interpret the content as letters | <say-as interpret-as="letters"> W3C < /say-as> : double-u three cee | |
words |
interpret the content as words | <say-as interpret-as="words"> ASCII </say-as> : askie |
detail
attributeThe detail
attribute is an optional
attribute that indicates the level of detail to be read aloud or
rendered. Every value of the detail
attribute must render all of the informational content in the
contained text; however, specific values for the detail
attribute can be used to render content
that is not usually informational in running text but may be
important to render for specific purposes. For example, an SSML
processor will usually render punctuations through appropriate
changes in prosody. Setting a higher level of detail may be used to
speak punctuations explicitly, e.g. for reading out coded part
numbers or pieces of software code.
The detail
attribute can be used for
all say-as content
types.
If the detail
attribute is not
specified, the level of detail that is produced by the SSML
processor depends on the text content and the locale.
When the value for the detail
attribute
is unknown or unsupported by a processor, it must render the
contained text as if no value were specified for the detail
attribute.
Example values for the detail
attribute: (please note that these values are just for
illustration; they are not suggested or endorsed values)
interpret-as | format | detail | interpretation | Examples |
dictate |
dictate the text | <say-as interpret-as="" detail="dictate">It's simple, isn't it?</say-as> : It's simple comma isn't it question mark | ||
letters |
strict |
speak letters with all detail | <say-as interpret-as="letters" detail="strict">X4:5à-bB2</say-as> : capital X four colon five A with grave accent dash B capital B two | |
number |
telephone |
punctuation |
speak the punctuation marks given in the telephone number | <say-as interpret-as="number" format="telephone" detail="punctuation">09/123.45.67</say-as> : zero nine slash one hundred twenty-three dot forty-five dot sixty-seven |
The phoneme element provides a phonetic pronunciation for the contained text. The phoneme element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.
The ph
attribute is a required
attribute that specifies the phoneme string.
The alphabet
attribute is an optional
attribute that specifies the phonetic alphabet. SSML processors
should support a value for alphabet
of
"ipa", corresponding to characters composing the
International Phonetic Alphabet [IPA]. In
addition to an exhaustive set of vowel and consonant symbols, IPA
supports a syllable delimiter, numerous diacritics, stress symbols,
lexical tone symbols, intonational markers and more.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <phoneme alphabet="ipa" ph="tɒmɑtoʊ"> tomato </phoneme> <!-- This is an example of IPA using character entities --> </speak>
It is an error if a value for alphabet
is specified that is not known or
cannot be applied by an SSML processor.
The following elements can occur within the content of the phoneme element: none.
The sub element is employed
to indicate that the specified text replaces the contained text for
pronunciation. This allows a document to contain both a spoken and
written form. The required alias
attribute
specifies the string to be substituted for the enclosed string. The
processor should apply text normalization to the alias
value.
The following elements can occur within the content of the sub element: none.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <sub alias="World Wide Web Consortium"> W3C </sub> <!-- World Wide Web Consortium --> </speak>
The voice element is a production element that requests a change in speaking voice. Attributes are:
xml:lang
: optional language specification
attribute.
gender
: optional attribute indicating
the preferred gender of the voice to speak the contained text.
Enumerated values are: "male", "female",
"neutral".
age
: optional attribute indicating the
preferred age of the voice to speak the contained text. Acceptable
values are of type (integer)
variant
: optional attribute indicating
a preferred variant of the other voice characteristics to speak the
contained text. (e.g. the second or next male child voice). Valid
values of variant
are integers.
name
: optional attribute indicating a
platform-specific voice name to speak the contained text. The value
may be a space-separated list of names ordered from top preference
down. As a result a name must not contain any whitespace.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <voice gender="female">Mary had a little lamb,</voice> <!-- now request a different female child's voice --> <voice gender="female" variant="2"> It's fleece was white as snow. </voice> <!-- platform-specific voice selection --> <voice name="Mike">I want to be like Mike.</voice> </speak>
When there is not a voice available that exactly matches the
attributes specified in the document, or multiple voices that match
the criteria, the voice selection algorithm may be
processor-specific.
If a voice is available for a requested xml:lang
, an SSML
processor must use it. If there are multiple such voices available,
the processor should use the voice that best matches the specified
values for name
, variant
, gender
and
age
.
If there is no voice available for the requested language, the
processor should select a voice that is closest to the requested
language (e.g. same language but different region). If there are
multiple such voices available, the processor should use a voice
that best matches the specified values for name
, variant
, gender
and age
.
It is an error if the processor decides
it does not have a voice that sufficiently matches the above
criteria.
Note: The group is considering adding more explicit control
over voice selection in a future version of the SSML
Specification.
voice attributes are inherited down the tree including to within elements that change the language.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <voice gender="female"> Any female voice here. <voice age="6"> A female child voice here. <paragraph xml:lang="ja"> <!-- A female child voice in Japanese. --> </paragraph> </voice> </voice> </speak>
A change in voice resets the prosodic parameters since different voices have different natural pitch and speaking rates. Volume is the only exception.
The quality of the output audio or voice may suffer if a change in voice is requested within a sentence.
The following elements can occur within the content of the voice element: audio, break, emphasis, mark, p, paragraph, phoneme, prosody, say-as, sub, s, sentence, voice.
The emphasis element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The synthesis processor determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:
level
: the optional level
attribute indicates the strength of
emphasis to be applied. Defined values are
"strong", "moderate",
"none" and "reduced". The default
level
is "moderate". The
meaning of "strong" and
"moderate" emphasis is interpreted according to
the language being spoken (languages indicate emphasis using a
possible combination of pitch change, timing changes, loudness and
other acoustic differences). The "reduced" level
is effectively the opposite of emphasizing
a word. For example, when the phrase "going to" is reduced it may
be spoken as "gonna". The "none" level
is used to prevent the speech synthesis
processor from emphasizing words that it might typically
emphasize.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> That is a <emphasis> big </emphasis> car! That is a <emphasis level="strong"> huge </emphasis> bank account! </speak>
The following elements can occur within the content of the emphasis element: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, voice.
The break element is an empty element that controls the pausing or other prosodic boundaries between words. The use of the break element between any pair of words is optional. If the element is not present between words, the speech synthesis processor is expected to automatically determine a break based on the linguistic context. In practice, the break element is most often used to override the typical automatic behavior of a synthesis processor. The attribute is:
time
: the time
attribute is an optional attribute
indicating the duration of a pause. Legal values are: durations in
seconds or milliseconds, "none",
"x-small", "small",
"medium" (default value),
"large", or "x-large". Durations
follow the "Times" attribute format from the [CSS2] specification. e.g. "250ms", "3s". The
value "none" indicates that a normal break
boundary should be used. The other five values indicate
increasingly larger break boundaries between words. The larger
boundaries are typically accompanied by pauses.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> Take a deep breath <break/> then continue. Press 1 or wait for the tone. <break time="3s"/> I didn't hear you! </speak>
The following elements can occur within the content of the break element: none.
The prosody element permits control of the pitch, speaking rate and volume of the speech output. The attributes, all optional, are:
pitch
: the baseline pitch for the
contained text. Legal values are: a number followed by "Hz", a relative change or
"x-high", "high",
"medium", "low",
"x-low", or "default".
contour
: sets the actual pitch contour
for the contained text. The format is specified in Pitch contour below.
range
: the pitch range (variability)
for the contained text. Legal values are: a number followed by "Hz", a relative change or
"x-high", "high",
"medium", "low",
"x-low", or "default".
rate
: the speaking rate in
words-per-minute for the contained text. Legal values are: a number, a relative change or
"x-fast", "fast",
"medium", "slow",
"x-slow", or "default".
duration
: a value in seconds or
milliseconds for the desired time to take to read the element
contents. Follows the Times attribute format from the [CSS2] specification. e.g. "250ms", "3s".
volume
: the volume for the contained
text in the range 0.0 to 100.0 (higher values are louder and
specifying a value of zero is equivalent to specifying
"silent"). Legal values are: number, a relative change or
"silent", "x-soft",
"soft", "medium",
"loud", "x-loud", or
"default". The volume scale is linear amplitude.
The default is 100.0.
Relative changes for the attributes above can be specified
rate
and volume
attributes, relative changes are a number preceded by "+" or "-", e.g.
"+10", "-5.5".pitch
and range
attributes, relative changes can be given
in semitones (a number preceded by "+"
or "-" and followed by "st") or in Hertz (a number preceded by "+" or "-" and
followed by "Hz"): "+0.5st", "+5st", "-2st", "+10Hz",
"-5.5Hz".<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> The price of XYZ is <prosody rate="-10%">$45</prosody> </speak>
The pitch contour is defined as a set of targets at specified
time positions in the speech output. The algorithm for
interpolating between the targets is processor-specific. In each
pair of the form (time position, target)
, the first
value is a percentage of the period of the contained text (a number followed by "%") and the second
value is the value of the pitch
attribute
(a number followed by "Hz", a relative change, or descriptive values
are all permitted). Time position values outside 0% to 100% are
ignored. If a pitch value is not defined for 0% or 100% then the
nearest pitch target is copied. All relative values for the pitch
are relative to the pitch value just before the contained text.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <prosody contour="(0%,+20)(10%,+30%)(40%,+10)"> good morning </prosody> </speak>
The duration
attribute takes precedence
over the rate
attribute. The contour
attribute takes precedence over the
pitch
and range
attributes.
The default value of all prosodic attributes is no change. For
example, omitting the rate
attribute means
that the rate is the same within the element as outside.
The following elements can occur within the content of the prosody element: audio, break, emphasis, mark, p, paragraph, phoneme, prosody, say-as, sub, s, sentence, voice.
All prosodic attribute values are indicative. If a speech synthesis processor is unable to accurately render a document as specified, (e.g. trying to set the pitch to 1Mhz, or the speaking rate to 1,000,000 words per minute.) it must make a best effort to continue processing by imposing a limit or a substitute for the specified, unsupported value and may inform the host environment when such limits are exceeded.
In some cases, SSML processors may elect to ignore a given prosodic markup if the processor determines, for example, that the indicated value is redundant, improper or in error. In particular, concatenative-type synthetic speech systems that employ large acoustic units may reject prosody-modifying markup elements if they are redundant with the prosody of a given acoustic unit(s) or would otherwise result in degraded speech quality.
The audio element
supports the insertion of recorded audio files (see Appendix D for required formats) and the insertion
of other audio formats in conjunction with synthesized speech
output. The audio element
may be empty. If the audio
element is not empty then the contents should be the marked-up text
to be spoken if the audio document is not available. The alternate
content may include text, speech markup, desc elements, or other audio elements. The alternate content may also be
used when rendering the document to non-audible output and for
accessibility (see the desc
element). The required attribute is src
,
which is the URI of a document with an
appropriate MIME type.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <!-- Empty element --> Please say your name after the tone. <audio src="beep.wav"/> <!-- Container element with alternative text --> <audio src="prompt.au">What city do you want to fly from?</audio> <audio src="welcome.wav"> <emphasis>Welcome</emphasis> to the Voice Portal. </audio> </speak>
An audio element is sucessfully rendered:
Deciding which conditions result in the alternative content being rendered is processor-dependent. If the audio element is not successfully rendered, an SSML processor should continue processing and should notify the hosting environment. A processor may determine after beginning playback of an audio source that the audio cannot be played in its entirety. For example, encoding problems, network disruptions, etc. may occur. The processor may designate this either as successful or unsuccessful rendering, but it must document this behavior.
The following elements can occur within the content of the audio element: audio, break, desc, emphasis, mark, p, paragraph, phoneme, prosody, say-as, sub, s, sentence, voice.
A mark element is an empty element that places a marker into the text/tag sequence. The mark element can be used to reference a specific location in the text/tag sequence, and can additionally be used to insert a marker into an output stream for asynchronous notification. When processing a mark element, an SSML processor must do one or both of the following:
name
attribute and with information allowing the
platform to retrieve the corresponding position in the rendered
output.name
attribute of
the element. The hosting environment defines the destination of the
event.The mark element does not affect the speech output process.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> Go from <mark name="here"/> here, to <mark name="there"/> there! </speak>
The following elements can occur within the content of the mark element: none.
The desc element can only occur within the content of the audio element. When the audio source referenced in audio is not speech, e.g. audio wallpaper or sonicon punctuation, it should contain a desc element whose textual content is a description of the audio source (e.g. "door slamming"). If text-only output is being produced by the SSML processor, the content of the desc element(s) should be rendered instead of other alternative content in audio.
The following elements can occur within the content of the desc element: none.
A legal stand-alone Speech Synthesis Markup Language document must have a legal XML Prolog [XML §2.8].
The XML prolog in a synthesis document comprises the XML declaration and an optional DOCTYPE declaration referencing the synthesis DTD. It is followed by the root speak element. The XML prolog may also contain XML comments, processor instructions and other content permitted by XML in a prolog.
The version number of the XML declaration indicates which
version of XML is being used. The version number of the speak element indicates which
version of the SSML specification is being used -- "1.0" for this
specification. The speak
version
is a required attribute.
The speak element must
designate the SSML namespace using the xmlns
attribute [XMLNS]. The namespace for SSML is defined to
be http://www.w3.org/2001/10/synthesis.
It is recommended that the speak element also include xmlns:xsi
and xsi:schemaLocation
attributes to indicate the
location of the SSML schema (see Appendix
C):
If present, the optional DOCTYPE must reference the standard DOCTYPE and identifier.
The following are two examples of legal SSML headers:
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en">
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en">
The language for the document is defined by the xml:lang
attribute on the speak element. See Section 2.1.2 for details.
The base URI for the document is defined by the xml:base
attribute on the speak element. See Section 3.2 for details.
The metadata and lexicon elements must occur before all other elements and text contained within the root speak element. There are no other ordering constraints on the elements in this specification.
Relative URIs are resolved according to a base URI, which may come from a variety of sources. The base URI declaration allows authors to specify a document's base URI explicitly. See Section 3.2.1 for details on the resolution of relative URIs.
The path information specified by the base URI declaration only affects URIs in the document where the element appears.
The base URI declaration is permitted but optional. The two elements affected by it are
- audio
- The optional
src
attribute can specify a relative URI.- lexicon
- The
uri
attribute can specify a relative URI.
The base URI declaration follows [XML-BASE] and is indicated by an xml:base
attribute on the root speak element.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:base="http://www.example.com/base-file-path">
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:base="http://www.example.com/another-base-file-path">
User agents must calculate the base URI for resolving relative URIs according to [RFC2396]. The following describes how [RFC2396] applies to synthesis documents.
User agents must calculate the base URI according to the following precedences (highest priority to lowest):
xml:base
attribute on the
speak element (see Section 3.2).A synthesis document may reference one or more external pronunciation lexicon documents. A lexicon document is identified by a URI with an optional media type.
The pronunciation information contained within a lexicon document is used only for words defined within the enclosing document.
The W3C Voice Browser Working Group is developing the Pronunciation Lexicon Markup Language [LEX]. The specification will address the matching process between words and lexicon entries and the mechanism by which a speech synthesis processor handles multiple pronunciations from internal and synthesis-specified lexicons. Pronunciation handling with proprietary lexicon formats will necessarily be specific to the synthesis processor.
Pronunciation lexicons are necessarily language-specific. Pronunciation lookup in a lexicon and pronunciation inference for any word may use an algorithm that is language-specific.
Any number of lexicon
elements may occur as immediate children of the speak element. The lexicon element must have a
uri
attribute specifying a URI that identifies the location of the
pronunciation lexicon document.
Issue: There has been some discussion as to whether the lexicon element should be permitted to occur within the content of elements other than speak. Reviewers are especially encouraged to provide feedback on this point.
The lexicon element may
have a type
attribute that specifies the
media type of the pronunciation
lexicon document.
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"> <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en"> <lexicon uri="http://www.example.com/lexicon.file"/> <lexicon uri="http://www.example.com/strange-words.file" type="media-type"/> ... </speak>
The following elements can occur within the content of the lexicon element: none.
The metadata element is a container in which information about the document can be placed using a metadata schema. Although any metadata schema can be used with metadata, it is recommended that the Resource Description Format (RDF) schema [RDF-SCHEMA] be used in conjunction with the general metadata properties defined in the Dublin Core Metadata Initiative [DC].
RDF is a declarative language and provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of items on the Web. Content creators should refer to W3C metadata Recommendations [RDF-SYNTAX] and [RDF-SCHEMA] when deciding which metadata RDF schema to use in their documents. Content creators should also refer to the Dublin Core Metadata Initiative [DC], which is a set of generally applicable core metadata properties (e.g., Title, Creator, Subject, Description, Copyrights, etc.).
Document properties declared with the metadata element can use any metadata schema.
Informative: This is an example of how metadata can be included in a speech synthesis document using the Dublin Core version 1.0 RDF schema [DC] describing general document information such as title, description, date, and so on:
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"> <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en"> <metadata> <rdf:RDF xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs = "http://www.w3.org/TR/1999/PR-rdf-schema-19990303#" xmlns:dc = "http://purl.org/metadata/dublin_core#"> <!-- Metadata about the synthesis document --> <rdf:Description about="http://www.example.com/meta.ssml" dc:Title="Hamlet-like Soliloquy" dc:Description="Aldine's Soliloquy in the style of Hamlet" dc:Publisher="W3C" dc:Language="en" dc:Date="2002-11-29" dc:Rights="Copyright 2002 Aldine Turnbet" dc:Format="application/ssml+xml" > <dc:Creator> <rdf:Seq ID="CreatorsAlphabeticalBySurname"> <rdf:li>William Shakespeare</rdf:li> <rdf:li>Aldine Turnbet</rdf:li> </rdf:Seq> </dc:Creator> </rdf:Description> </rdf:RDF> </metadata> </speak>
The following SSML elements can occur within the content of the metadata element: none.
The Synchronized Multimedia Integration Language (SMIL, pronounced "smile") [SMIL] enables simple authoring of interactive audiovisual presentations. SMIL is typically used for "rich media"/multimedia presentations which integrate streaming audio and video with images, text or any other media type. SMIL is an easy-to-learn HTML-like language, and many SMIL presentations are written using a simple text-editor. See the SMIL/SSML integration examples in Appendix A.
Aural style sheets [CSS2, Section 19] are employed to augment standard visual forms of documents (like HTML) with additional elements that assist in the synthesis of the text into audio. In comparison to SSML, ACSS-generated documents are capable of more complex specifications of the audio sequence, including the designation of 3D location of the audio source. Many of the other ACSS elements overlap SSML functionality, especially in the specification of voice type/quality. SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.
The Voice Extensible Markup Language [VoiceXML] enables Web-based development and content-delivery for interactive voice response applications. VoiceXML supports speech synthesis, recording and playback of digitized audio, speech recognition, DTMF input, telephony call control, and form-driven mixed initiative dialogs. VoiceXML 2.0 extends SSML for the markup of text to be synthesized. For an example of the integration between VoiceXML and SSML see [Appendix A].
The fetching and caching behavior of SSML documents is defined by the environment in which the SSML processor operates. In a VoiceXML interpreter context for example, the caching policy is determined by the VoiceXML interpreter.
A synthesis document fragment is a Conforming Speech Synthesis Markup Language Fragment if:
xmlns
attributes which refer to
non-synthesis namespace elements are removed from the
document,<?xml...?>
) is included at the top of the
document,xmlns
attribute, then
xmlns="http://www.w3.org/2001/10/synthesis"
is added
to the element.A document is a Conforming Stand-Alone Speech Synthesis Markup Language Document if it meets both the following conditions:
The SSML specification and these conformance criteria provide no designated size limits on any aspect of synthesis documents. There are no maximum values on the number of elements, the amount of character data, or the number of characters in attribute values.
The synthesis namespace may be used with other XML namespaces as per the Namespaces in XML Recommendation [XMLNS]. Future work by W3C will address ways to specify conformance for documents involving multiple namespaces.
A Speech Synthesis Markup Language processor is a program that can parse and process Speech Synthesis Markup Language documents.
In a Conforming Speech Synthesis Markup Language Processor, the XML parser must be able to parse and process all XML constructs defined by XML 1.0 [XML] and Namespaces in XML [XMLNS]. This XML parser is not required to perform validation of a Speech Synthesis Markup Language document as per its schema or DTD; this implies that during processing of a Speech Synthesis Markup Language document it is optional to apply or expand external entity references defined in an external DTD.
A Conforming Speech Synthesis Markup Language Processor must correctly understand and apply the semantics of each markup element as described by this document.
A Conforming Speech Synthesis Markup Language Processor must meet the following requirements for handling of languages:
When a Conforming Speech Synthesis Markup Language Processor encounters elements or attributes in a non-synthesis namespace it may:
There is, however, no conformance requirement with respect to performance characteristics of the Speech Synthesis Markup Language Processor. For instance, no statement is required regarding the accuracy, speed or other characteristics of speech produced by the processor. No statement is made regarding the size of input that a Speech Synthesis Markup Language Processor must support.
A Conforming User Agent is a Conforming Speech Synthesis Markup Language Processor that is capable of accepting a Speech Synthesis Markup Language document as input and producing a spoken output by using the information contained in the markup to render the document as intended by the author.
Since the output cannot be guaranteed to be a correct representation of all the markup contained in the input there is no conformance requirement regarding accuracy. A conformance test may, however, require some examples of correct synthesis of a reference document to determine conformance.
This document was written with the participation of the following members of the W3C Voice Browser Working Group (listed in alphabetical order):
This appendix is Informative.
The following is an example of reading headers of email messages. The paragraph and sentence elements are used to mark the text structure. The break element is placed before the time and has the effect of marking the time as important information for the listener to pay attention to. The prosody element is used to slow the speaking rate of the email subject so that the user has extra time to listen and write down the details.
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"> <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en"> <paragraph> <sentence>You have 4 new messages.</sentence> <sentence>The first is from Stephanie Williams and arrived at <break/> 3:45pm. </sentence> <sentence> The subject is <prosody rate="-20%">ski trip</prosody> </sentence> </paragraph> </speak>
The following example combines audio files and different spoken voices to provide information on a collection of music.
<?xml version="1.0" encoding="ISO-8859-1?> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"> <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en"> <paragraph> <voice gender="male"> <sentence>Today we preview the latest romantic music from the W3C.</sentence> <sentence>Hear what the Software Reviews said about Tim Lee's newest hit.</sentence> </voice> </paragraph> <paragraph> <voice gender="female"> He sings about issues that touch us all. </voice> </paragraph> <paragraph> <voice gender="male"> Here's a sample. <audio src="http://www.w3c.org/music.wav"/> Would you like to buy it? </voice> </paragraph> </speak>
The SMIL language [SMIL] is an XML-based multimedia control language. It is especially well suited for describing dynamic media applications that include synthetic speech output.
File 'greetings.ssml' contains the following:
<?xml version="1.0" encoding="ISO-8859-1?> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"> <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en"> <sentence> <mark name="greetings"/> <emphasis>Greetings</emphasis> from the <sub alias="World Wide Web Consortium">W3C</sub>! </sentence> </speak>
SMIL Example 1: W3C logo image appears, and then one second later, the speech sequence is rendered. File 'greetings.smil' contains the following:
<smil xmlns="http://www.w3.org/2001/SMIL20/Language"> <head> <top-layout width="640" height="320"> <region id="whole" width="640" height="320"/> </top-layout> </head> <body> <par> <img src="http://w3clogo.gif" region="whole" begin="0s"/> <ref src="greetings.ssml" begin="1s"/> </par> </body> </smil>
SMIL Example 2: W3C logo image appears, then clicking on the image causes it to disappear and the speech sequence to be rendered. File 'greetings.smil' contains the following:
<smil xmlns="http://www.w3.org/2001/SMIL20/Language"> <head> <top-layout width="640" height="320"> <region id="whole" width="640" height="320"/> </top-layout> </head> <body> <seq> <img id="logo" src="http://w3clogo.gif" region="whole" begin="0s" end="logo.activateEvent"/> <ref src="greetings.ssml"/> </seq> </body> </smil>
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"> <form> <block> <prompt> <emphasis>Welcome<emphasis> to the Bird Seed Emporium. <audio src="rtsp://www.birdsounds.example.com/thrush.wav"/> We have 250 kilogram drums of thistle seed for $299.95 plus shipping and handling this month. <audio src="http://www.birdsounds.example.com/mourningdove.wav"/> </prompt> </block> </form> </vxml>
This appendix is Informative.
The synthesis DTD is located at http://www.w3.org/TR/speech-synthesis/synthesis.dtd.
Due to DTD limitations, the SSML DTD does not correctly express that the metadata element can contain elements from other XML namespaces.
This appendix is Normative.
The synthesis schema is located at http://www.w3.org/TR/speech-synthesis/synthesis.xsd.
Note: the synthesis schema includes a no-namespace core schema, located at http://www.w3.org/TR/speech-synthesis/synthesis-core.xsd, which may be used as a basis for specifying Speech Synthesis Markup Language Fragments [Sec. 4.1] embedded in non-synthesis namespace schemas.
This appendix is Normative.
SSML requires that a platform support the playing of the audio formats specified below.
Audio Format | Media Type |
---|---|
Raw (headerless) 8kHz 8-bit mono mu-law [PCM] single channel. (G.711) | audio/basic (from [RFC1521]) |
Raw (headerless) 8kHz 8 bit mono A-law [PCM] single channel. (G.711) | audio/x-alaw-basic |
WAV (RIFF header) 8kHz 8-bit mono mu-law [PCM] single channel. | audio/wav |
WAV (RIFF header) 8kHz 8-bit mono A-law [PCM] single channel. | audio/wav |
The 'audio/basic' mime type is commonly used with the 'au' header format as well as the headerless 8-bit 8Khz mu-law format. If this mime type is specified for recording, the mu-law format must be used. For playback with the 'audio/basic' mime type, processors must support the mu-law format and may support the 'au' format.
This appendix is Informative.
The W3C Voice Browser Working Group has applied to IETF to register a MIME type for the Speech Synthesis Markup Language. The current proposal is to use "application/ssml+xml".
The W3C Voice Browser Working Group has adopted the convention of using the ".ssml" filename suffix for Speech Synthesis Markup Language documents where speak is the root element.
This appendix is Informative.
The following features are under consideration for versions of the Speech Synthesis Markup Language Specification after version 1.0:
Authors may wish to style speech by selecting a voice with a regional accent. For instance, one might wish to use a Scottish accent for speaking some English text. One way to achieve this is to request a named voice known to have the desired accent. The names for such voices are generally vendor specific. Further discussions may lead to the emergence of conventions for naming voices with specific regional accents, and in principle, could result in an extended set of generic voice names for SSML.
A partial work around is to use the xml:lang
attribute, which
is defined by the XML 1.0 [XML]
specification to describe the language in which the document
content is written. The values of this attribute are language
identifiers as defined by RFC 3066 [RFC3066]. These identifiers can be used to
identify country-wide variants of languages, based upon the use of
ISO 3166 [ISO3166] country codes. Thus
"en-us" denotes US English, while "en-gb" denotes UK English.
This offers a limited means to influence which accent is selected, through the choice of the corresponding ISO 3166 country code. Unfortunately, there is no standard for designating regions within countries, as would be needed for a portable way to request accents such as Scottish or Welsh.
This appendix is Normative.
SSML is an application of XML 1.0 [XML] and thus supports [UNICODE] which defines a standard universal character set.
Additionally, SSML provides a mechanism for precise control of
the input and output languages via the use of the xml:lang
attribute. This facility
provides: