Internationalizing Speech Synthesis

http://www.w3.org/2006/Talks/0525-ka-voice/

Kazuyuki Ashimura <ashimura@w3.org>
25 May, 2006

Voice Browser: integration of Web & Speech Tech

Applying Web technology to enable users to access services from their telephone.

Voice Browser

Accessing public information: weather, cinema schedule, etc.
Accessing private services: voicemail, bank, airline reservation, etc.

Framework for Voice Browser

DFP: New Speech Interface Framework

Modular approach faciliates application development, maintainance, debugging and reuse.

Voice Browser architecture and standards

Speech Interface Framework specifications

Presentation layer:

Flow layer:

Data layer:

Currently included in SCXML

The Voice Browser Working Group

The biggest group in W3C (88 participants from 36 organisations)
Specifications in progress:
VoiceXML 2.1, VoiceXML 3.0, SISR, PLS, CCXML, SCXML, SSML
Discussion for generating specs:
- Telephone Conferences (Every week)
- Face-to-Face meeting (Once per 3 months)
- Mailing List (Public, WG Internal)
- IRC connection

Why Internationalizing SSML?

Global users of the Web

The Web is not only for English-native people but also everyone in the world.
- We should consider international connection services between many countries.
- So SSML should provide various features for multi languages all over the world.

Extension of SSML ability

Enhancements for non-English languages makes SSML more useful in current and emerging markets!
(e.g. China, Korea, Japan, etc.).
- Better pronunciation and prosody are essential for richer synthesis.
- And non-English speech synthesis tecnology includes lots of useful hints.

Not only "What" but also "How to say"

is important to disambiguate multiple pronunciations.

SSML 1.0 vocabulary provides various ways to eliminate pronunciation ambiguities.

Word-level, phoneme-level and waveform-level controls (<phoneme>, <say-as>, ...)

However, there are still many problems remaining...

Because one specific character sequence can be pronounced as various pronunciations.
- Text input provides only "What to say" information.
- Additional information like prosody is very important as "How to say" information

Example 1: Rhythm & Pause in English

I'd like to go to Menzies Belford Hotel.

Which is better?
Can you understand me, and take me to the hotel?

Synthesized speech with no pause specification

vs.

Synthesized speech with pause specification

Example 2: Accent variations in Japanese

A certain character sequence can have several different meanings with different pitch accents.

Note: "'" means that there is an accent nucleus (= perceived pitch falling).

Example 3: Speech style variations in Japanese

A certain character sequence can have even opposite meanings with different combination of duration and intonation.

Person A:	"お昼ごはんを食べましょう" (o-hiru gohan wo tabe masho) Shall we go to lunch?
Person B:	"うん" (un) Yes or No...

Recent activities on I18N of SSML

Workshops

First Workshop in Beijing: 2-3 Nov. 2005

In order to identify and prioritize issues to improve the use of SSML for rendering non-English languages
Speech Experts from Asian and Western countries participated
Extensions and additions to SSML are identified and prioritized
Agenda & Minutes: http://www.w3.org/2005/08/SSML/ssml-workshop-agenda.html

Second Workshop in Greece: 30-31 May (Next Week!)

In order to solicit additional suggestions to increase the use of SSML for other non-English languages.
Participants are speech experts on various languages such as Indian, Syrian, Arabic, Hungarian, Polish, Finish, Slovenian, etc.
Agenda: Second Workshop in Greece: 30-31 May (Next Week!)

Issues clarified in the Beijing Workshop

High Priority: Common problems in many languages

Word boundaries
Denote languages and dialects
Phonetic alphabets

Middle Priority: Language specific

Chinese names
Special words (Name, Number, ...)
Tones and tone sandhi
Sentence structure
Text with multiple languages
mora/sec
Ruby

Lower Priority: Difficult to standardize...

Diacritics
Expand POS
Expressive elements
Background sound
Syllable markup

Current status & Plan

Basic policy to generate specifications

High and Middle Priority issues:
→ will be included in SSML 1.1
Lower Priority issues:
→ are included only if some WG participants support and implmement them.

Discussion within SSML Subgroup

Discussion on SSML 1.1 is ongoing in the SSML subgroup of VBWG
Many participants in the subgroup from China
Kickoff meeting held: 18-19 April

Conclusion

The work of the W3C Voice Browser Working Group
- What Voice Browser is
- Speech Interface Framework Specification
Internationalizing SSML
- Problem of Pronunciation Disambiguation
- Issues and requirements identified in Workshop
Recent activity of the Voice Browser Working Group
- SSML Subgroup in VBWG
- Next Workshop in Greece: 30-31 May, 2006

More information available

These slides:
- http://www.w3.org/2006/Talks/0525-ka-voice/
SSML Workshops:
- the SSML Workshop in Beijing, November 2005
- the Second SSML Workshop in Greece, May 2006
Voice Browser Working Group:
W3C:
- W3C Mission
- Membership Benefits

Thanks!

Appendices

available below...

Controls for prosodic information

To solve the problem of pronunciation ambiguities, additional specification must be provided to SSML.

Especially, controls for prosodic information are essential for Asian tonal languages.
Such controls can be specified for each step of TTS process to control each DB and/or Model (e.g. model selection, parameters for model).

Category of prosodic controls

According to Fujisaki , prosodic information is classified into three categories.
Therefore we should consider these three categories when we discuss prosodic controls.

Linguistic Information:

Symbolic information represented by a set of discrete symbols and rules for their combination.
It can be represented either explicitly by the written language, or can be easily and uniquely inferred from context.
It is discrete and categorical, for example, character sequences, parts of speech, accent types, etc.

Paralinguistic Information:

Information not inferable from the written counterpart but deliberately added by the speaker to modify or supplement the linguistic information.
It can be both discrete and continuous, for example, duration and speech rate, fundamental frequency transition, spectrum transition, etc.

Nonlinguistic Information:

Information concerns factors as age, gender, idiosyncrasy, physical and emotional states of the speaker.
It is not directly related to linguistic information nor paralinguistic information, and not generally under control of the speaker.

Possible prosodic controls

There are various prosodic controls which are useful for rendering non-English languages.
Some of them are already included in SSML 1.0, others should be added.
Additional topics and extensions to current SSML will be proposed in the Workshops.

Items in black:	Examples of potential controls borrowed from Fujisaki's definition
Items in red:	Elements for prosodic controls in SSML 1.0

Category of prosody	Input Level
Category of prosody	Text Analysis	Prosody Analysis	Waveform Production
Linguistic Information	character sequences part of speech accent types <p> <s> <say-as> <sub> <lexicon> <phoneme>	?	?
Paralinguistic Information	?	duration and speech rate fundamental frequency transition spectrum transition <prosody> <emphasis> <break>	<prosody> (partly)
Nonlinguistic Information	?	?	age gender idiosyncrasy physical and emotional states of the speaker <voice> <audio>