Speech Synthesis Markup Language Version 1.1 Requirements

W3C Working Draft 11 June 2007

This version:
Latest version:
Previous version:
Daniel C. Burnett, Nuance
双志伟 (Zhi Wei Shuang), IBM
Scott McGlashan, HP
Andrew Wahbe, Genesys
夏海荣 (Hairong Xia), Panasonic
严峻 (Yan Jun), iFLYTEK
吴志勇 (Zhiyong Wu), Chinese University of Hong Kong


In 2005, 2006, and 2007 the W3C held workshops to understand the ways, if any, in which the design of SSML 1.0 limited its usefulness for authors of applications in Asian, Eastern European, and Middle Eastern languages. In 2006 an SSML subgroup of the W3C Voice Browser Working Group was formed to review this input and develop requirements for changes necessary to support those languages. This document contains those requirements.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is the 11 June 2007 W3C Working Draft of "Speech Synthesis Markup Language Version 1.1 Requirements".

This document describes the requirements for changes to the SSML 1.0 specification required to fulfill the charter given in [Section 1.2]. This is the second Working Draft. The group does not expect this document to become a W3C Recommendation. Changes since the previous version are listed in Appendix A.

This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group. You are encouraged to subscribe to the public discussion list <www-voice@w3.org> and to mail us your comments. To subscribe, send an email to <www-voice-request@w3. org> with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). A public archive is available online.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Table of Contents

1. Introduction

This document establishes a prioritized list of requirements for speech synthesis markup which any proposed markup language should address. This document addresses both procedure and requirements for the specification development. In addition to general requirements, the requirements are addressed in separate sections on Speech Interface Framework Consistency, Token/Word Boundary, Phonetic Alphabet and Pronunciation Script, Language Category, and Name/Proper Noun Identification Requirements, followed by Future Study and Acknowledgements sections.

1.1 Background and motivation

As a W3C standard, one of the aims of SSML (see [SSML] for description) is to be suitable and convenient for use by application authors and vendors worldwide. A brief review of the most broadly-spoken world languages [LANGUAGES] shows a number of languages that are in large commercial or emerging markets for speech synthesis technologies but for which there was limited or no participation by either native speakers or experts during the development of SSML 1.0. To determine in what ways, if any, SSML is limited by its design with respect to supporting these languages, the W3C held three workshops on the Internationalization of SSML. The first workshop [WS], in Beijing, PRC, in October 2005, focused primarily on Chinese, Korean, and Japanese languages, and the second [WS2], in Crete, Greece, in May 2006, focused primarily on Arabic, Indian, and Eastern European languages. The third workshop [WS3], in Hyderabad, India, in January 2007, focused heavily on Indian and Middle Eastern languages.

These three workshops resulted in excellent suggestions for changes to SSML, describing the ways in which SSML 1.0 has been extended and enhanced around the world. An encouraging result from the workshops was that many of the problems might be solvable using similar, if not identical, solutions. In fact, it may be possible to increase dramatically the usefulness of SSML for many application authors around the world by making a limited number of carefully-planned changes to SSML 1.0. That is the goal of this effort.

1.2 SSML 1.1 subgroup charter

The scope for a W3C recommendation for SSML 1.1 is modifications to SSML 1.0 to

  1. Provide broadened language support
    1. For Mandarin, Cantonese, Hindi*, Arabic*, Russian*, Korean*, and Japanese, we will identify and address language phenomena that must be addressed to enable support for the language. Where possible we will address these phenomena in a way that is most broadly useful across many languages. We have chosen these languages because of their economic impact and expected group expertise and contribution.
    2. We will also consider phenomena of other languages for which there is both sufficient economic impact and group expertise and contribution.
  2. Fix incompatibilities with other Voice Browser Working Group languages, including Pronunciation Lexicon Specification [PLS], Speech Recognition Grammar Format [SRGS], and VoiceXML 2.0/2.1 (e.g., caching attributes and error processing.) [VXML2, VXML21].

VCR-like controls are out of scope for SSML 1.1. We may discuss <say-as> (see [SAYAS]) issues that are related to the SSML 1.1 work above and collect requirements for the next document that addresses <say-as> values. We will not create specifications for additional <say-as> values but may publish a separate Note containing the <say-as> requirements specifically related to the SSML 1.1 work. We will follow standard W3C procedures.

* provided there is sufficient group expertise and contribution for these languages

1.3 Requirements development process

The General Requirements in section 2 arose out of SSML-specific and general Voice Browser Working Group discussions. The Speech Interface Framework Consistency Requirements in section 3 were generated by the Voice Browser Working Group. The SSML subgroup developed the charter. The remaining requirements were then developed as follows:

First, the SSML subgroup grouped topics presented and discussed at the workshops (see Section 1.1) into the following categories:

The group agrees to work on these topics.
After the short-term work is complete the group will revisit these topics to determine whether or not they belong in the scope of SSML 1.1 and can be completed by the SSML subgroup.
Experts needed
We need experts in other relevant languages to actively participate in the subgroup before we can make the decision to work on these topics in this subgroup.
Other SSML work
These topics are out of scope for SSML 1.1. These items belong in SSML 2.0 or later, a separate <say-as> Note (see [SAYAS]), etc.

The following table shows how the topics were categorized. There is no implied ordering within each column.

Short-term (group agrees to work on this) Long-term (after short-term work will revisit to determine if belongs in group) Experts needed (in order to make decision to work on this in this subgroup) Other SSML work (SSML 2.0 or later, <say-as> Note, etc.
Token/word boundaries Tones Providing number, case, gender agreement info Special words
Phonetic alphabets Expand Part-Of-Speech support Syllable markup Tone sandhi
Verify that RFC3066 language categories are complete enough that we do not need anything new beyond xml:lang to identify languages and dialects Text with multiple languages (changing xml:lang without changing voice; separately specifying language of content and language to speak) Diacritics, SMS text, simplified/alternate text Enhance prosody rate to include "speech units per time unit" where speech units would be syllable, mora, phoneme, foot, etc. and time unit would be seconds, ms, minutes, etc.(would address mora/sec request)
Chinese names (say-as requirements)   Sub-word unit demarcation and annotation Background sound (may be handled best by VoiceXML3 work)
Ruby   Transliteration Expressive elements
      Sentence structure

Next, for each topic in the Short-term list, we developed one or more problem statements. Where applicable, the problem statements have been included in this document.
We then generated requirements to address the problem statements.

It is interesting to note that the three Long-term topics have been addressed by the requirements developed while working on the Short-term topics: Tones are addressed via the pronunciation alphabets, Part-Of-Speech support may be at least partially addressed via requirement 4.2.3, and Text with multiple languages is being addressed as part of the language category requirements.

The topics in the remaining two categories (Experts needed and Other SSML work) are listed and briefly described in the Future Study section.

2. General Requirements

2.1 Backwards compatibility

SSML 1.1 should be backwards compatible with SSML 1.0 except where modification is necessary to satisfy other requirements in this document.

2.2 Use of IRIs instead of URIs

SSML 1.1 may use Internationalized Resource Identifiers [RFC3987] instead of URIs.

3. Speech Interface Framework Consistency Requirements

This section must include requirements that make SSML consistent with the other Speech Interface Framework specifications, including VoiceXML 2.0/2.1, PLS, SRGS, and SISR in both behavior and syntax, where possible.

3.1 Caching attributes

3.1.1 <audio> caching attributes

SSML must support the maxage and maxstale attributes for the <audio> element as supported in VoiceXML 2.1.
SSML lacks these attributes, so it is not clear how SSML enforces (or even has) a caching model for audio resources.

3.1.2 <lexicon> caching attributes

SSML must support the maxage and maxstale attributes for the <lexicon> element.

3.1.3 Caching defaults

SSML should provide a mechanism for an author to set default values for the maxage and maxstale attributes.

3.2 Error messages in VoiceXML 3.0

SSML should provide error messages and include detail.

SSML 1.0 defines error [SSML §1.5] as "Error Results are undefined. A conforming synthesis processor may detect and report an error and may recover from it." Note that in the case of an <audio> where there is a protocol error fetching the URI resource, or whether the resource cannot be played, VoiceXML might log this information in its session variables. The error information likely to be required: URI itself, protocol response code and a reason (textual description). It is expected that the SSML processor would recover from this error (play fallback content if specified, or ignore the element).

3.3 "type" attribute

The <audio> element should be extended with a type attribute to indicate the media type of the URI. It may be used

  1. to indicate to the web server a preferred mime type, and
  2. to indicate the type of resource where such information isn't already covered by the protocol (e.g. file protocol).

The handling of the requested type versus an authoritative type returned by a protocol would follow the same approach described for the type in <lexicon> [SSML Section 3.1.4]. On a type mismatch, the processor should play the audio if it can.

3.4 VCR controls in VoiceXML

SSML should be modified as necessary to operate effectively with VCR controls VoiceXML is looking to introduce.

3.4.1 SSML 1.1 should provide a mechanism to indicate that only a subset of the entire <speak> content is to be rendered. This mechanism should allow designation of the start and end of the subset based on time offsets from the beginning of the <speak> content, the end of the <speak> content, and marks within the content.

3.4.2 It would be nice if SSML 1.1 provided a mechanism to indicate that only a subset of the content of an <audio> element is to be rendered. This mechanism, if provided, should allow designation of the start and end of the subset based on time offsets from the beginning of the <audio> content, the end of the <audio> content, and marks within the content.

3.4.3 SSML 1.1 should provide a mechanism to adjust the speed of the rendered <speak> content.

3.4.4 It would be nice if SSML 1.1 provided a mechanism to either adjust or set the average pitch of the rendered <speak> content.

3.4.5 SSML 1.1 should provide a mechanism to either adjust or set the volume of the rendered <speak> content.

3.5 Lexicon synchronization

Authors must be given explicit control over which <lexicon>-specified lexicons are active for which portions of the document. This will allow explicit activation/deactivation of lexicons.

3.6 Prefetching support

It would be nice if SSML were modified to support prefetching of audio as defined by the "fetchhint" attribute of the <audio> tag in VoiceXML 2.0 [VXML2]. The exact mechanism used by the VoiceXML interpreter to instruct the SSML processor to prefetch audio may be out of scope. However, SSML should at a minimum recommend behavior for asserting audio resource freshness at the point of playback. This clarifies how audio resource prefetching and caching behaviors interact.

3.7 External reference to text structure

SSML 1.1 must provide a way to uniquely reference <p>, <s>, and the new word-level element (see Section 4) for cross-referencing by external documents.

4. Token/Word Boundary Requirements

This section must include requirements that address the following problem statement:

All TTS systems make use of word boundaries to do synthesis. All Chinese/Thai/Japanese systems today must do additional processing to identify word boundaries because white-space is not normally used as a boundary identifier in written language. In this processing, errors that occur can cause poorer output quality and even misunderstandings. Overall TTS performance for these systems can be improved if document authors can hand-label the word boundaries where errors are expected or found to occur.

4.1 Word boundary disambiguation

SSML 1.1 must provide a mechanism to eliminate word segmentation ambiguities. This is necessary in order to render languages

Resulting benefits can include improved cues for prosodic control (e.g., pause) and may assist the synthesis processor in selection of the correct pronunciation for homographs.

4.2 Annotation of words

4.2.1 SSML 1.1 must provide a mechanism for annotating words.

4.2.2 SSML 1.1 must standardize an annotation of the language using mechanisms similar to those used elsewhere in the specification to identify language.

4.2.3 SSML 1.1 must standardize a mechanism to refer to the correct pronunciation in the Pronunciation Lexicon Specification, in particular when there are multiple pronunciations for the same orthography. This will enhance the existing implied correspondence between words and pronunciation lexicons.

5. Phonetic Alphabet and Pronunciation Script Requirements

This section must include requirements that address the following problem statement:

Although IPA (and its textual equivalents) provides a way to write every pronunciation for every language, for some languages there are alternative pronunciation scripts (not necessarily phonetic/phonemic) that are already widely known and used; these scripts may still require some modifications to be useful within SSML. SSML requires support for IPA and permits any string to be used as the value of the "alphabet" attribute in the <phoneme> element. However, TTS vendors for these languages want a standard reference for their pronunciation scripts. This might require extra work to define a standard reference.

5.1 Registry for alternative pronunciation scripts

5.1.1 SSML 1.1 must enable the use of values for the "alphabet" attribute of the <phoneme> element that are defined in a registry that can be updated independent of SSML. This registry and its registration policy must be defined by the SSML subgroup.

The intent of this change is to encourage the standardization of alternative pronunciation scripts, for example Pinyin for Mandarin, Jyutping for Cantonese, and Ruby for Japanese.

As part of the discussion on the registration policy, the SSML subgroup should consider the following:

5.1.2 The registry named in 4.1.1 should be maintained through IANA.

6. Language Category Requirements

This section must include requirements that address the following problem statement:

The xml:lang attribute in SSML is the only way to identify the language. It represents both the natural (human) language of the text content and the natural (human) language the synthesis processor is to produce. For languages whose scripts are ideographs rather than pronunciation-related, we are not sure that the permitted values for xml:lang, as specified by RFC3066, are detailed enough to distinguish among languages (and their dialects) that use the same ideographs.

6.1 Successor to RFC3066 support

SSML 1.1 must ensure the use of a version of xml:lang that uses the successor specification to RFC3066 [RFC3066] (for example, BCP47 [BCP47]).

This will provide sufficient flexibility to indicate all of the needed languages, scripts, dialects, and their variants.

6.2 xml:lang requirements

6.2.1 SSML 1.1 must clearly state that the 'xml:lang' attribute identifies the language of the content.

6.2.2 SSML 1.1 must clearly state that processors are expected to determine how to render the content based on the value of the 'xml:lang' attribute and must document expected rendering behavior for the xml:lang values they support.

6.2.3 SSML 1.1 must specify that selection of xml:lang and voice are independent. It is the responsibility of the TTS vendor to decide and document which languages are supported by which voices and in what way.

7. Name/Proper Noun Identification Requirements

This section must include requirements on a future version of <say-as> to support better interpretation of Chinese names and Korean proper nouns.

In some languages, it is necessary to do some special handing to identify names/proper nouns. For example, in some Asian languages, the pronunciation of characters used in Chinese surnames and Korean proper nouns will change. If the name/proper noun is properly marked, there is a predictable pronunciation for it. Such a requirement is crucial and must be satisfied because, in languages such as Chinese and Korean, there is no obvious tag to identify names/proper nouns from other contents (e.g. there is no capitalization as used in English) and it is often difficult for the speech synthesis processor to automatically identify all the names/proper nouns properly.

It is also important to identify which part of a name is the surname and which part(s) is/are the given name(s) since there might be several patterns of different surname/given name combinations. For example,

7.1 Identify content as proper noun

A future version of SSML must provide a mechanism to identify content as a proper noun.

7.2 Identify content as name

A future version of SSML must provide a mechanism to identify content as a name. This might be done by creating a new "name" value for the interpret-as attribute of the <say-as> element, along with appropriate values for the format and detail attributes.

7.3 Identify name sub-content as surname

A future version of SSML must provide a mechanism to identify a portion of a name as the surname.

8. Future Study

This section contains issues that were identified during requirements capture but which have not been directly incorporated into the current set of requirements. The descriptions are not intended to be exhaustive but rather to give a brief explanation of the core idea(s) of the topics.

8.1 Number, gender, case agreement

Japanese, Hungarian, and Arabic words all vary by number, gender, case, and/or category. An example difficulty occurs in reading numeric values from news feeds, since the actual spoken numbers may change based on implied context. By providing this context the synthesizer can generate the proper word.

8.2 Syllable markup

The two main use cases/motivations for this capability are

  1. boundary delineation/foundational unit: For languages that are syllable-based or for which syllable boundaries are important (e.g., for morphological analysis), this capability could be quite useful. It may be that other existing requirements for arbitrary pronunciation alphabets can mitigate this somewhat by allowing authors to use a boundary-marking alphabet targeted at their own language.
  2. desire for prosodic or other markup at this level: This is a special case of Section 8.4, below.

The current belief is that this markup is not needed in order to accomplish the stated objectives of SSML 1.1. Since markup of syllables and particularly the use of prosodic markup at a syllable level challenges the implicit word-level foundation of SSML 1.0, changes of this nature are likely to be far-reaching in consequence for the language. Unless this is later discovered to be necessary, this work should wait for a fuller rewrite of SSML than is anticipated for SSML 1.1.

8.3 Diacritics, SMS text, simplified/alternate text

There are a number of cases where SSML is used to render other-than-traditional forms of text. The most common of these appears to be mobile text messages. It is fairly common to see significantly abbreviated text (such as "cul8r" for "see you later" in English) and, for non-English languages, text that does not properly use native character sets. Examples include dropped diacritics in Polish (eg., the word pączek written as the word paczek) or the use of the three-symbol string '}|{' to represent the Russian letter 'Ж'.

8.4 Sub-word unit demarcation and annotation

In Chinese, the foundational writing unit is the character, and although there may be many different pronunciations for a given character, each pronunciation is only a single syllable. It is thus common in Chinese synthesis processors to be able to control prosodic information such as contrastive stress at the syllable level.

Hungarian is a highly agglutinative language whose significant morphological variations are represented in the orthography. Thus, contrastive stress may need to be marked at a sub-word level. For example, “Nem a dobozon, hanem a dobozban van a könyv” means “The book is not in the box, but on the box.”

Note that the approaches currently being considered to address the requirements in Section 4 may provide a limited ability to do sub-word prosodic annotation.

8.5 Transliteration

Many of the languages on the Indian subcontinent are based on a common set of underlying phonemic units and have writing systems (scripts) that are based on these underlying units. The scripts for these languages may differ substantially from one another, however, and from the historic Indian script specifically designed for writing pronunciations. Additionally, because of the spread of communication systems in which it is easier to write in Latin scripts (or ASCII, in particular) than in native scripts, India has seen a proliferation of ASCII-based writing systems that are also based on the same underlying phonemic units. Unfortunately, these ASCII-based writing systems are not standardized.

The challenge for speech synthesis systems today is that the system will often use several lexicons, each of which uses a different pronunciation writing system. Pronunciations given inline by an author may also be in a different (and potentially non-standard) writing system. This challenge is currently addressed for Indian speech synthesis systems by using transliteration among code pages. Each code page describes how a particular writing system maps into a canonical writing system. It is thus possible for a synthesis processor to know how to convert any text into a representation of pronunciation that can be looked up in a lexicon.

Although the need to use different pronunciation alphabets will be addressed for standard alphabets, i.e., those for the different Indian languages, to address the user-specific ASCII representations a more generic mapping facility might be needed. Such a capability might also address the common issue of how to map mobile phone short message text into the standard grapheme representations used in a lexicon.

8.6 Special words

Many new values for the "interpret-as" attribute of the <say-as> element have been suggested. Common ones include URI, email address, postal address, and email. Although clearly useful, these values are similar, if not identical, to ones considered during the development of the Say-as Note [SAYAS]. It is not clear which, if any, of the values suggested are critically, or at least more, necessary for languages other than those for which SSML 1.0 works well today. These suggestions from the workshops may be incorporated into future work on the <say-as> element, which is outside the scope of the SSML 1.1 effort.

8.7 Tone Sandhi

When the nominal tones of sequences of syllables in Chinese match certain patterns, the actual spoken tones change in predictable ways. For example, in Mandarin if two tone 3 syllables occur together, the first will actually be pronounced as tone 2 instead of tone 3. Similar, but different, rules apply for Cantonese and for the many other spoken languages that use the written Han characters. This need may be addressed sufficiently by other requirements in this document.

8.8 More flexible prosody rate

The rate attribute of the <prosody> element in SSML 1.0 only allows for relative changes to the speech rate, not absolute settings. A primary reason for this was lack of agreement on what units would be used to set the rate -- phonemes, syllables, words, etc. With the feedback received so far, it would be possible to enhance the prosody rate to permit absolute values of the form " X speech units per time unit" where speech units could be selected by the author to be syllable, mora, phoneme, foot, etc. and time units could be selected by the author to be seconds, ms, minutes, etc. This is a good example of a feature that should be considered if and when an SSML 2.0 is developed.

8.9 Background sound

There are many requests to permit a separate audio track to be established to provide background speech, music, or other audio. This feature is about audio mixing rather speech synthesis, so either it should be handled outside of SSML (via SMIL [SMIL2] or via a future version of VoiceXML) or a more thorough analysis of what audio mixing capabilities are desired should be done as part of a future version of SSML.

8.10 Expressive elements

There are requests for speaking style ("news", "sports", etc.) and emotion portrayal ("angry", "joyful", "sad") that represent high-level requests that result in rather sophisticated speech production changes, and historically there has been insufficient agreement on how these styles would be rendered. However, this is slowly changing -- see, for example, the W3C Emotion Incubator Group [EMOTION]. This category of request most definitely should be considered when developing a future version of SSML.

8.11 Sentence structure

SSML 1.0 has only two explicit logical structure elements: <paragraph> and <sentence>. In addition, whitespace is used as an implicit word boundary. There have been requests to provide other sub-sentence structure such as phrase markers (and explicit word marking, one of the requirements earlier in this document). The motivations for such features vary slightly but usually center around providing improved prosodic control. This is a good topic to reconsider in a future, possibly completely rewritten, version of SSML.

9. References

IETF BCP47, currently represented by Tags for the Identification of Languages, A. Phillips, M. Davis, Editors. IETF, September 2006. This RFC is available at http://www.ietf.org/rfc/rfc4646.txt.
W3C Emotion Incubator Group, World Wide Web Consortium. The group's website is available at http://www.w3.org/2005/Incubator/emotion/.
Handbook of the International Phonetic Association , International Phonetic Association, Editors. Cambridge University Press, July 1999. Information on the Handbook is available at http://www2.arts.gla.ac.uk/ipa/handbook.html.
The 30 Most Spoken Languages of the World , KryssTal, 2006. The website is available at http://www.krysstal.com/spoken.html.
Pronunciation Lexicon Specification (PLS) Version 1.0, Paolo Baggia, Editor. World Wide Web Consortium, 26 October 2006. This version of the PLS Working Draft is http://www.w3.org/TR/2006/WD-pronunciation-lexicon-20061026/ and is a Work in Progress. The latest version is available at http://www.w3.org/TR/pronunciation-lexicon/.
Tags for the Identification of Languages, H. Alvestrand, Editor. IETF, January 2001. This RFC is available at http://www.ietf.org/rfc/rfc3066.txt.
Internationalized Resource Identifiers (IRIs), M. Duerst and M. Suignard, Editors. IETF, January 2005. This RFC is available at http://www.ietf.org/rfc/rfc3987.txt.
Speech Recognition Grammar Specification Version 1.0 , Andrew Hunt and Scott McGlashan, Editors. World Wide Web Consortium, 16 March 2004. This version of the SRGS 1.0 Recommendation is http://www.w3.org/TR/2004/REC-speech-grammar-20040316/. The latest version is available at http://www.w3.org/TR/speech-grammar/.
SSML 1.0 say-as attribute values , Daniel C. Burnett and Paolo Baggia, Editors. World Wide Web Consortium, 26 May 2005. This version of the Say-as Note is http://www.w3.org/TR/2005/NOTE-ssml-sayas-20050526/. The latest version is available at http://www.w3.org/TR/ssml-sayas/.
Synchronized Multimedia Integration Language , Dick Bulterman, et al., Editors. World Wide Web Consortium, 13 December 2005. This version of the SMIL 2 Recommendation is http://www.w3.org/TR/2005/REC-SMIL2-20051213/. The latest version is available at http://www.w3.org/TR/SMIL2/.
Speech Synthesis Markup Language (SSML) Version 1.0 , Daniel C. Burnett, et al., Editors. World Wide Web Consortium, 7 September 2004. This version of the SSML 1.0 Recommendation is http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/. The latest version is available at http://www.w3.org/TR/speech-synthesis/.
Voice Extensible Markup Language (VoiceXML) Version 2.0 , Scott McGlashan, et al., Editors. World Wide Web Consortium, 16 March 2004. This version of the VoiceXML 2.0 Recommendation is http://www.w3.org/TR/2004/REC-voicexml20-20040316/. The latest version is available at http://www.w3.org/TR/voicexml20/.
Voice Extensible Markup Language (VoiceXML) 2.1, Matt Oshry, et al., Editors. World Wide Web Consortium, 25 April 2007. This version of the VoiceXML 2.1 Proposed Recommendation ishttp://www.w3.org/TR/2007/PR-voicexml21-20070425/. The latest version is available at http://www.w3.org/TR/voicexml21/.
Minutes, W3C Workshop on Internationalizing the Speech Synthesis Markup Language, 2-3 November 2005. The agenda and minutes are available at http://www.w3.org/2005/08/SSML/ssml-workshop-agenda.html.
Minutes, W3C Workshop on Internationalizing the Speech Synthesis Markup Language, 30-31 May 2006. The agenda is available at http://www.w3.org/2006/02/SSML/agenda.html. The minutes are available at http://www.w3.org/2006/02/SSML/minutes.html.
Minutes, W3C Workshop on Internationalizing the Speech Synthesis Markup Language, 13-14 January 2007. The agenda is available at http://www.w3.org/2006/10/SSML/agenda.html. The minutes are available at http://www.w3.org/2006/10/SSML/minutes.html.

10. Acknowledgements

The editors wish to thank the members of the Voice Browser Working Group involved in this activity (listed in family name alphabetical order):

芦村和幸 (Kazuyuki Ashimura), W3C
Paolo Baggia, Loquendo
Paul Bagshaw, France Telecom
Jerry Carter, Nuance
馮恬瑩 (Tiffany Fung), Chinese University of Hong Kong
黄力行 (Lixing Huang), Chinese Academy of Sciences
Jim Larson, Intel
楼晓雁 (Lou Xiaoyan), Toshiba
蒙美玲 (Helen Meng), Chinese University of Hong Kong
陶建华 (JianHua Tao), Chinese Academy of Sciences
王霞 (Wang Xia), Nokia

Appendix A. Changes since the previous version