W3C logo

Internationalizing SSML

Kazuyuki Ashimura,
W3C, Team contact for the Voice Browser Working Group


Why Internationalizing SSML?

Global users of the Web

Extension of SSML ability

Problem to be solved: Pronunciation ambiguity

Example of pronunciation ambiguity in Japanese (1)

A certain character sequence can have several different meanings with different pitch accents.

kaki kaki_oyster

Note: "'" means that there is accent nucleus (= perceived pitch falling).

Example of pronunciation ambiguity in Japanese (2)

Sometimes a certain character sequence can have even opposite meanings with different combination of duration and intonation.

un un_yes

Controls for prosodic information

To solve the problem of pronunciation ambiguities, additional specification must be provided to SSML.

Category of prosodic controls

According to Fujisaki , prosodic information is classified into three categories.
Therefore we should consider these three categories when we discuss prosodic controls.

Linguistic Information
  • Symbolic information represented by a set of discrete symbols and rules for their combination.
  • It can be represented either explicitly by the written language, or can be easily and uniquely inferred from context.
  • It is discrete and categorical, for example, character sequences, parts of speech, accent types, etc.

Paralinguistic Information
  • Information not inferable from the written counterpart but deliberately added by the speaker to modify or supplement the linguistic information.
  • It can be both discrete and continuous, for example, duration and speech rate, fundamental frequency transition, spectrum transition, etc.

Nonlinguistic Information
  • Information concerns factors as age, gender, idiosyncrasy, physical and emotional states of the speaker.
  • It is not directly related to linguistic information nor paralinguistic information, and not generally under control of the speaker.

Possible prosodic controls

Items in black Examples of potential controls borrowed from Fujisaki's definition
Items in red: Elements for prosodic controls in SSML 1.0

Category of prosody Input Level
Text Analysis Prosody Analysis Waveform Production
  • character sequences
  • part of speech
  • accent types
  • <p>
  • <s>
  • <say-as>
  • <sub>
  • <lexicon>
  • <phoneme>
? ?
  • duration and speech rate
  • fundamental frequency transition
  • spectrum transition
  • <prosody>
  • <emphasis>
  • <break>
  • <prosody> (partially)
? ?
  • age
  • gender
  • idiosyncrasy
  • physical and emotional states of the speaker
  • <voice>
  • <audio>

Let's get started

Goals & Scope of the workshop


  1. Diacritics for auto-completion
  2. Representing special word classes
  3. Representing word boundaries
  4. Denoting language and character sets
  5. Tones
  6. Sentence structure
  7. Words with multiple pronunciations and meanings
  8. Text with multiple languages
  9. Expression, speaking style, and focus
  10. Other extensions and/or additions to SSML