1． W3C organization by Max
a) How does W3C work?
i. A member organization
ii. About 50 working groups.
iii. A well-defined and efficient work process.
b) The voice browser working group
i. 88 participants in 33 organizations.
i. Working Draft ->LCWD->Candidate Recommendation->PR->Recommendation
d) Intellectual Property
i. A clear patent policy
ii. Open and Royalty Free Standards
iii. Unique in the standards world.
e) Join W3C and help us build.
2． Understanding say-as: chief editor, SSML
a) Background on this confusing feature of the language.
b) Guiding principles of SSML
i. Convenient annotation of existing text for audio rendering.
ii. Control at all levels, from text structure and normalization to prosodic control and even voice characteristics.
iii. Limited critical error conditions-“rendering must go on”.
c) Guiding principles of say-as
i. Primary purpose of say-as to be able to correctly interpret text commonly written in human-readable documents.
ii. “Intended for when the processor has insufficient context to interpret ambiguous text”.
iii. Interpretation, not rendering.
d) Interpretation, not rendering.
i. Pronounce the contained text.
e) Types limited in behavior.
i. Say-as Note type (“interpret-as” value) inclusion criteria.
ii. Difficult to write the orthography for by hand.
f) ‘Data’, ‘Time’,
g) Why did you remove type blah? Or why is there no type foo?
i. Either the use case for it was about rendering rather than interpretation, or there was doubt on its importance.
h) Summary, Conclusions, & 2 Questions.
i. What is the best way to accomplish semantic category-based rendering control.
3． Pronunciation Lexicon
a) Standard way: IPA
b) Other alphabet: (They should be a standard)
i. SAMPA (No standard till now)
ii. Pying, JEITA, etc.
c) The current PLS in monolingual.
d) The PLS language -<lexeme>
i. The <lexeme> element is the container of a lexicon entry. It is composed of.
e) The PLS language-<grapheme>
i. Different style with same pronunciation
f) The PLS language-<phoneme>
i. Can change alphabet.
g) The PLS language-<alias>
i. Especially useful for ASR.
h) Use cases/Future Issues
i. Multiple pronunciations for ASR
i) But it can not deal with:
ii. Part of speech annotations ( and other contextual information.
j) Quick demo of SSML + PLS
k) Standard lexicons?
4． Why internationalizing SSML?
a) Global users of the Web.
i. The web is not only for English-native people but also everyone in the world.
b) Extension of SSML ability.
i. Enhancement for non-
c) Problem to be solved: Pronunciations ambiguity.
d) Prosodic Controls
i. Text Analysis: <p>,<s>, <say-as>, <lexicon>, <phoneme>
ii. Prosody Analysis: duration and speech rate.
iii. Fundamental frequency transition. <prosody> <emphasis> <break>
e) Goals & Scope of the workshop
i. To identify and prioritize extensions and additions to SSML.
5． Session 2:
a) Polish Telecom
i. The nature of the problem
1. Diacritics: sometimes called an accent mark, is a mark added to a letter to alter a word’s pronunciation or to distinguish between similar words.
2. Example: Polish letters with diacritics.
a) 35 letters = 26 basic + 9 with diacritics
3. Different pronunciation with diacritics.
4. Why Polish Diacritics sometimes disappear?
a) No possibility to obtain while typing.
b) 5 times pressing key to input one diacritics
5. quasi-Polish text(without diacritics)
a) Sometimes it is the only possibility to represent text.
ii. Similarities among other languages
1. Other languages:
a) Czech, Slovak
d) French, etc.
iii. Possible solutions
1. How to solve the problem?
a) A new dialect?
b) An alternative spelling (context dependent orthography)?
c) An erroneous text that requires correction (jargon)?
2. TTS solve it or External lexicons
1. Instant message: invented words & phrases.
2. Reduced character or Different character set:
i. Simplified Chinese: Zh_Hans
ii. Traiditonal Chinses: zh_Hant
iii. Chinese Romanization: zh_latin
3. To solve the problem
a) We can use “slan’ce’” to describe diacritics.
b) Jargon or broken
i. Jargon: no lose of information.
ii. Broken Text: lose information.
4. May we have a possibility of freely choosing components from different vendors.
6． Session 3:
a) An Introduction to S3ML
1. SSML & SinoVoice
2. Pinyin in Phoneme attribute.
3. <say-as> Definition
a) name, address, math, net
4. Domain Support
a) <voice domain = “”> element.
b) <domain name = “”> element
c) Some jargon in the domain.
i. Charactersitics of Chinese
1. Rich in dialects:
2. No explicit phrase and word boundaries.
3. Monosyllablic and tonal.
ii. Proposed attributes for existing elements
iii. Proposed elements
1. <phrase> and <word>
a) If we know the <word> boundary, there won’t be homograph.
iv. Proposed attribute values
i. Which do you believe to be particular necessary for Chinese.
1. Chinese name: difficult to distinguish from other character.
2. URL is distinguishable from Chinese Characters.
7． Session 4
a) Iflytek company
Pinyin is widely used in
ii. Words composed of English letters, we need to separate the Pinyin and English words.
2. Pinyin words: Anhui, Hefei, Jiang Zemin
iii. Segmentation of Chinese Word
1. Word and Phrase element
2. What is definition of word and phrase?
iv. Using background music.
1. <environment repeat = “yes” src = “1.wav”> Text…<environment>
1. iFLYTEK setup the enterprise standard CSSML in 2002.
2. Since 2003, the CSSML has been supported by iFLYTEK products.
3. CSSML was voted as a candidate of national standard.
4. Is there any other company support CSSML? No.
5. Is there any intelligent patent for CSSML? No.
i. Chinese Romanization for Chinese Voice Browsing.
i. Tone: (Special Attribute in phoneme element)
iii. Word boundary:
a) 上海是个 大都会：Shanghai is a metropolitican
b) 上海人 大都 会：
i. <w detail = ‘3’>上海人大都会</w>
i. Character Pronunciations*
2. Part of speech: POS
ii. Word/Phrase Boundaries*
1. L0: syllable boundary
2. L1: prosodic word boundary
3. L2: minor phrase boundary
4. L3: major phrase boundary
<p xml:lang = “zh-cn” ssml:lang2 = “cn-sc”> <p>
iv. Sound Effect
1. <prosody post-filter = “some-filter”></prosody>
v. Speaking Style
1. <p ssml:prosody-template = “#1”>***</p>
1. <macro name = “date”>2005/10/20</macro>
vii. Say-as Extension:
a) if you can synthesize foreign language, then no need to translation
a) Should allow multiple choice.
i. Sentence Structure & Word Boundary
3. Paragraph->Sentence->W (word)
4. Japanese : morpheme, POS is useful for the pronunciation
5. Korean: have space separate word.
a) Morpheme Dictionary
b) Word boundary
2. Part of Speech
3. Phrase marking in general for all language: not
8． Session 5
i. JEITA Speech Group:
1. Expert Committee on Speech Input/Output
2. First version: JEITA-62-2000, Revised version: JEITA-IT-4002
ii. Japanese Pronunciation in phoneme element:
1. “x-JEITA-IT-4002-kana”, etc.
iii. How to specify speaking rate in Japanese
1. A basic unit in Japanese in Mora“mora” is called “拍”(haku) in Japanese.
a) ko N ni chi wa -> 5 moras
b) sya si n -> 3 moras
c) sya sin -> 2 moras
2. So mora is fit for Japanese speaking rate.
a) Japanese can specify: 4.5 mora per second.
b) mora can be used to indicate break length.
c) Chinese can specify: ** syllables per minute.
iv. Ruby element
1. Pronunciation Annotation: 今日 kyowa (may be wrong)
a) There is a “Ruby Annotation-W3C Recommendation 32 May 2001”.
b) But the proposed one is simpler and enough.
2. Can this be covered by phoneme?
a) Is there any standard for Japanese Pronunciation Annotation?
v. Expansion of an say-as element
Interpret as “wago” : some kind of date format in
Several kind of date formats in
Hanguil: Chinese characters in
2. Korean people basically in Korean way, but sometimes in Chinese way or Japanese way.
ii. Chinese Characters in Korean
1. Chinese characters can be used.
2. 2000 Chinese characters are frequency used Chinese characters.
1. “ko”, “ko-CN” “ja-KR“, “cn-KR“(pronounce in Chinese way):
2. Using different lexicon.
iv. Homograph Words in Korean.
1. Same word, different pronunciation, different meaning
a) The only difference between pronunciations is duration.
2. Suggest “tone” tag for this problem.
a) long, short duration
i. Dialects (Production:RFC3066 (new))
1. xml:lang = “zh-cmn-CN“
2. xml:lang = “zh-cmn-TW“
3. xml:lang = “zh-sc“: Sichuan Dialect
* It may become: xml:lang = “zh-Hans“ speak-as = zh-CMN-CN
ii. What character set: How to decide which character set?
1. Candidate Alphabet
c) JEITA: IPA, kana,
d) LHSK Cantonese Romanization
e) SAMPA (x-SAMPA)
f) Korea KT standard
2. We may set up a registration process.
a) Country & Associations need to standardize it.
9． Session 6: Tone
a) Chinese Tone System
b) Cantonese Tone System
i. Some tone’s difference is duration!
c) Syllable in Roman alphabet, tone as one-digit Arabic number.
d) Popular schemes are:
e) Proposed <tone> Element
i. Tone: vary with meaning, context and speaking style
ii. an tone element.
f) Examples of Using “tone” Element
i. Tone changes with meaning.
ii. Tone changes with context
iii. Tone changes with speaking style
g) Tone sandhi (Rules)
i. For Mandarin: Tone3 Tone3 ->Tone2 Tone3
ii. For Cantonese:
1. Tone4 Tone4 -> Tone4 Tone2 OR
2. Tone4 Tone4 -> Tone4 Tone1
3. Tone4 Tone4 (no change)
iii. Which tone shall we mark?
1. Original Tone.
h) Chinese Languages
i. Chinese Dialects:
1. In sum,100 Tones for different languages
1. Is this phenomenon similar in western country? Different articulation.
2. The TTS engine can do it. We may not include them in SSML.
3. We can use a complex one (IPA) or a simpler one (Pinyin).
4. For example: add a tone element between 9-24 elements value, is this sufficient for specifying most Chinese Dialects.
i. We agree to recommend:
1. It is necessary for SSML to support and encourage the definition of dialect-specific phonemic inventories. These may be syllabic, but must incorporate all phonemically relevant distinctions, such as: lexical tones, syllable duration, etc.
ii. Still under discussion:
1. Need for explicit tone marking element. (Beyond the use of phonemic inventory)
2. Need for explicit marking of tone Sandhi. (beyond the use of IPA)
a) Slight leaning towards IPA being sufficient because this is allophonic variation.
3. Need for explicit marking of phonemically-relevant syllable duration.
4. SSML need to encourage to include more phonemic inventories.
5. Chinese Dialect <= tonal syllable inventory. If a syllable inventory has been published and widely accepted, then it is OK.
iii. Still under discussion:
1. Only mark tone change, or mark whole syllable.
10． Session 7: Sentence Structure
a) Speech Synthesis Markup Language – Aim at Extension
1. Original SSML->STML->JSML->SABLE->SSML
2. What we want from markup language
c) Extended to multimedia
3. Which level we should focus:
a) Text analysis module
b) Prosody module
c) Acoustic module
4. Text level for Mandarin
a) Word boundary
b) Pronunciation with tone
5. Prosody Level for Mandarin
6. Extensions to expressive synthesis
a) Emotion and Style
ii. Current elements related to prosody and style are not enough.
1. Current Voice Element
a) Element: voice “gender”, “age”, “name”, “variant”, “sample”
1. To make it more expressive
a) Background music
b) VTTS: talking head
c) Currently, we only can see the element “mark”.
i. Focused discussion: How to represent the structure of text using SSML?
1. Paragraph->Sentence -> Phrase->Token
a) Allows for further annotation.
3. Paragraph->Sentence->Word or Phrase (3,2,3)
a) Only one element not two elements
b) Why can not we use space to separate word?
4. Is there any lexical elements required?
i. We agree that there need to be a way to be away to explicitly work lexical tokens (that may consist of multiple script characters).
ii. There is a lot of interest in adding phrase markings in SSML, but no agreement on how (or even on whether can be agreed upon).
iii. We recommend that this topic be considered in any future SSML work.
iv. Because some language do not always explicitly mark word/token boundaries in their written scripts.
11． Session 8: Words with multiple pronunciations and meanings
i. Multiple pronunciation problem
1. Same word but different pronunciations
2. Same spelling but different pronunciations (homograph)
ii. POS for resolving multiple pronunciations.
1. POS information for LVCSR
1. POS attribute of phoneme element
1. No element or attribute for resolving multiple pronunciations
a) In current SSML, PLS
2. POS information
a) Can reduce the overhead of resolving multiple pronunciations in ASR and TTS systems.
b) Can reduce the search time in a large vocabulary recognition system.
c) Can be effective in agglutinative language.
a) POS element, POS attribute
1. How many categories of POS?
a) Is there any standard for POS?
i. verb, noun, …, 5-6
b) Opinion from companies:
i. Iflytek: Use POS inside system, no requirement for POS in mark up language.
iii. IBM: category and sub-category, totally 40-50 categories. Some of the categories focus on right pronunciation. For different companies, they may have different considerations.
v. JEITA: Internally
We note that synthesis processors can make use of POS information. However, there is disagreement on whether users of SSML and PLS should be able to provide this information to the processor. (as opposed to the processor determining the information for itself.)
There may not be enough industry consensus to standardize what POS information to represent and how to represent it.
12． Session 9: Text with multiple languages
a) Nokia: Klatt (Formant) & Concatenative TTS
ECESS XML: in
b) Peculiarities in Asian Languages
i. Asian Language
3. No word marker/break;
ii. Multi-lingual phenomenon
iii. WORD element <word>
1. Important in languages that don’t have word boundaries (e.g. Thai, Chinese, Vietnamese);
2. Crucial for tone sandhi since many tone changes happen within a word;
i. One suggestion: “mp3”: Reading character, digit as Chinese.
ii. We need to indicate the chunk of source language, and how to render the chunk.
iii. The processors can determine how to deal with it.
iv. Two option:
a) I said 早上好 to him.
b) I said La doke vita to him
c) 我说hello 给他
13． Session 10: Expression, Speaking style and Focus
1. Emotion & Mood
2. Expressing Pattern
a) Style: news, sports comment, dialog, info
b) Emotion: Positive, neutral, negative, etc
i. +1 for positive, 0 for neutral, -1 for negative
c) Mood: request, acquisition, given, affirmation, apology.
3. Characteristic: Voice tag
ii. Expression of Speech
iii. Hierarchical Prosodic Structure
1. Utterance, bg, pph, pw, syl
2. BG: breath group
i. Human has the strong ability of information reconstruction.
1. Music with noise
2. Graph with noise
ii. The value of synthesis of focus
iii. Key challenges in synthesis of focus
1. Difficult to locate a focus in a sentence.
iv. <EMPHASIS> in SSML
v. <focus> element
1. The focus element indicates that the contained text be the semantic centre and the carrier of important information of a sentence.
2. Most of focuses are realized by stresses. But some of them are realized by pause or intonation.
vi. Differences between focus and emphasis
1. Focus is the concept of semantics and pragmatics
2. Focus always carries the purpose of utterance.
3. Focus: logical structure Emphasis: Rendering:
When we come to semantic level, we can discuss focus again.
b) Speaking Style:
i. Iflytek & Sinovoice: May be useful in the future, now it is difficult to realize.
ii. Nokia: consider it in database design
iii. One Japanese company: support this.
c) In order to standardize, we need to have companies implemented systems. Less agreement on working on this feature today.
d) Conclusion: What should we do about expressive?
ii. (1) Is there enough agreement on “What are they?” (2) Is there enough agreement on “How to render them?”
iii. We revisit this to understand whether it is ready
1. Iflytek: some technology proves that some kind of expressiveness is implementable.
2. JEITA: They implement some kind of expressiveness, but they can not enumerate the kind of expressions. It is difficult to standardize.
3. Other companies: don’t know how to implement.
4. Whether it is optional?
15． Other issues
a) Background audio:
i. How to synchronize?
ii. Right now, there is no need to control the audio
i. Current Ruby standard is too complex.
i. Mora per second (Japanese)
i. For speech synthesizer don’t find proper item in the original languages’ lexicon, we may check in another language’s lexicon. (English mixing German)
i. Indicate topic area: text normalization, you may exclude other domains’ data.
f) Syllable markup
i. Whether SSML need to be used for annotation?
ii. We want to keep the simple, easy and hard possible.
i. Synchronize with Multi-model working group, such as talking head.
16． What are the next steps?
a) Voice browsing working group: A small sub-group
i. Find out requirements and priority
ii. To specify and to develop the specification.
iii. Implementation report:
1. Every thing in standard need to be proved implementable.
b) Interested People:
c) Conduct workshop(s) for other languages
i. Semitic languages
ii. Slavic languages
i. Summarize this meeting.
1. Voice browsing group may separate a SSML working group.
List the items discussed in the meeting, then send every people a
survey email. Let people prioritize.