Internationalizing Speech Synthesis

http://www.w3.org/2006/Talks/0525-ka-voice/

Kazuyuki Ashimura <ashimura@w3.org>
25 May, 2006

Edinburgh Morning

Contents

Voice Browser: integration of Web & Speech Tech

Applying Web technology to enable users to access services from their telephone.

Voice Browser

Framework for Voice Browser

DFP: New Speech Interface Framework

Modular approach faciliates application development, maintainance, debugging and reuse.

Voice Browser architecture and standards

Speech Interface Framework specifications

Presentation layer:

Flow layer:

Data layer:

The Voice Browser Working Group

Why Internationalizing SSML?

Global users of the Web

Extension of SSML ability

Not only "What" but also "How to say"

is important to disambiguate multiple pronunciations.

 

SSML 1.0 vocabulary provides various ways to eliminate pronunciation ambiguities.

 

However, there are still many problems remaining...

Example 1: Rhythm & Pause in English

I'd like to go to Menzies Belford Hotel.

Which is better?
Can you understand me, and take me to the hotel?

 

Example 2: Accent variations in Japanese

A certain character sequence can have several different meanings with different pitch accents.

kaki kaki_oyster
kaki_persimmon

Note: "'" means that there is an accent nucleus (= perceived pitch falling).

Example 3: Speech style variations in Japanese

A certain character sequence can have even opposite meanings with different combination of duration and intonation.

Person A: "お昼ごはんを食べましょう" (o-hiru gohan wo tabe masho)
Shall we go to lunch?
Person B: "うん" (un)
Yes or No...

un un_yes
un_no

Recent activities on I18N of SSML

Workshops

First Workshop in Beijing: 2-3 Nov. 2005

 

Second Workshop in Greece: 30-31 May (Next Week!)

Issues clarified in the Beijing Workshop

High Priority: Common problems in many languages

Middle Priority: Language specific

Lower Priority: Difficult to standardize...

Current status & Plan

Basic policy to generate specifications

Discussion within SSML Subgroup

Conclusion

More information available

Thanks!

Appendices

available below...

Controls for prosodic information

To solve the problem of pronunciation ambiguities, additional specification must be provided to SSML.

TTS Flow

Category of prosodic controls

According to Fujisaki , prosodic information is classified into three categories.
Therefore we should consider these three categories when we discuss prosodic controls.

Linguistic Information:
  • Symbolic information represented by a set of discrete symbols and rules for their combination.
  • It can be represented either explicitly by the written language, or can be easily and uniquely inferred from context.
  • It is discrete and categorical, for example, character sequences, parts of speech, accent types, etc.

Paralinguistic Information:
  • Information not inferable from the written counterpart but deliberately added by the speaker to modify or supplement the linguistic information.
  • It can be both discrete and continuous, for example, duration and speech rate, fundamental frequency transition, spectrum transition, etc.

Nonlinguistic Information:
  • Information concerns factors as age, gender, idiosyncrasy, physical and emotional states of the speaker.
  • It is not directly related to linguistic information nor paralinguistic information, and not generally under control of the speaker.

Possible prosodic controls


Items in black Examples of potential controls borrowed from Fujisaki's definition
Items in red: Elements for prosodic controls in SSML 1.0

Category of prosody Input Level
Text Analysis Prosody Analysis Waveform Production
Linguistic
Information
  • character sequences
  • part of speech
  • accent types
  • <p>
  • <s>
  • <say-as>
  • <sub>
  • <lexicon>
  • <phoneme>
? ?
Paralinguistic
Information
?
  • duration and speech rate
  • fundamental frequency transition
  • spectrum transition
  • <prosody>
  • <emphasis>
  • <break>
  • <prosody> (partly)
Nonlinguistic
Information
? ?
  • age
  • gender
  • idiosyncrasy
  • physical and emotional states of the speaker
  • <voice>
  • <audio>