Minutes of the Second Workshop on Internationalizing SSML

Foundation for Research and Technology - Hellas (FORTH) in Heraklion, Crete,
site of the W3C Office in Greece

30-31 May 2006

Photo_from_second_SSML_Workshop

Each session includes the presentation of one or two papers, followed by a discussion about at least one item presented in the papers. Some discussions will refer to items from several previously presented papers.

 

Attendees

Jerneja Zganec Gros (Alpineon, Slovenia)
Geza Nemeth (BME TMIT)
Geza Kiss (BME TMIT)
Nixon Patel (Bhrigus Inc.)
Raghunath K. Joshi (Centre for Development of Advanced Computing/C-DAC Mumbai)
Chris Vosnidis (Dialogos Speech Communications S.A.)
Bonardo Davide (Loquendo)
Kimmo Parssinen (Audio Applications, Nokia Specific Symbian SW, Technology Platforms)
Ksenia Shalonova (OutsideEcho)
Oumayma Dakkak (HIAST)
Przemyslaw Zdroik (France Telecom R&D Poland)
Paolo Baggia (Loquendo)
Max Froumentin (W3C)
Kazuyuki Ashimura (W3C)
Richard Ishida (W3C)
Dan Burnett (W3C Invited Expert)

 

Tuesday 30 May, 8:30-18:00

Session 1: Introductory

Moderator:
Kazuyuki Ashimura
Scribe:
Max Froumentin
Welcome and meeting logistics — FORTH –[Slides]

none

Workshop expectations — Kazuyuki Ashimura –[Slides]

none

Introduction on W3C and Voiec Browser Working Group — Max Froumentin –[Slides]

none

Internationalization of SSML — Dan Burnett [Slides]

none

PLS for SSML — Paolo Baggia –[Slides]

Q: if a word is homophone and homograph, what would be the hierarchy. In dewanagery the word "kurl" is "hand", also "do". Spelling is the same, meaning different. Which would you take care first.

Paolo: don't care. If pronunciation is the same, then the TTS will say it right. It's at another level, that of the semantic Web (outside of the workshop)

Nixon: but the answer is right there.

Paolo: nothing prevents a lexicon with 2 entries: same graphemes, same pronunciation but different "role". We'll talk about this problem in the relevant sections.


Q: Why not use SAMPA for phonetic alphabet

Paolo: IPA is something you can reference. SAMPA has many variants, and even some companies has their own. IPA is difficult to write but at least it's one alphabet that tries to cover all sounds

Updates to RFC 3066 — Richard Ishida –[Slides]

Max: what script Tag for Japanese?

Richard: not sure how that works.


Anna: for Greek, it would be interesting for Ancient Greek?

Richard: would be interesting indeed. But for modern greek, that would be not needed.


Q: people sometimes tend to abuse that, e.g. in Greek, they write in latin scripts in SMS, and don't use diacritics.

A: Poles do that too, in the case of Polish, it's not "real" Polish


Kazuyuki: difference between Scottish English and Irish or Welsh

Richard: dialects


Nixon: we need to come up with a breakdown of dialects, how do we register them

Richard: register with IANA


Dan: IANA allows you to create registries and specify how to add values (who's responsible).
  E.g. Top Level Domain Names, MIME types, character sets.

(BREAK)

Session 2: Languages / Dialects

Moderator:
Richard Ishida
Scribe:
Dan Burnett / Paolo Baggia
Ksenia Shalonova: Position paper for SSML workshop in Crete –[Slides, wave1, wave2, wave3]

Topics include: tones, dialects & styles, schwa deletion

Perpsective for the Local Language Speech Technology Initiative
- provide tools for languages in non developed countries
- KiSwahili (whole East Africa 20 million)
- isiZulu   (South Africa - government sponsor african languages)
- Hindi

- non developed countries not access to PC info from mobile phones
  (cheap) also Internet access
- Many kiosks
- Huge number of illiterate people

- There are business opportunities in these countries
  - kiSwahili - services in Kenia and Tanzania
  - isiZulu   - Kiosk in South Africa
  - Hindi     - info to book railway tickets

Decomposition of words into constituents
Nixon Patel: SSML Extensions for Text to Speech systems in Indian Languages –[Slides]

Topics include: syllables, loan words, <dialect>

- Nature of Indian language scritps
- Issues across TTS rendering in all languages 

Speech Language Technology Lab @ Bhrigus
- Playing leadership role
- 10 members and advisors
  3 PhDs + 4 Masters
- initiating SSML and VXML chapters in India

Nature of Indian languages
- basic units of the writing system are AKsharas
- Aksharas is syllabic in nature
  forms are V, CV, CCV, CCCV
  - always ends with a vowel (or nasalized vowel)
    in written form
- ~1652 dialects/native languages
- 22 languages officially recognized

Convergence of IL Script
- Aksharas are syllabic in nature
- Common Phonetic base
  - Share common set of speech sounds across all
    language
0 Fairly good

- Each Indian languages (IL) has its own script
- All share common phonetic base
- Non tonal

How to represent Indian lanugage Scripts
- Unicode
  Useful for rendering the Indian
  - not suitable for keying 
  - not suitable to build modules sucg as text-normalization
- Itrans-3 / OM - A transliteration scheme by IISc Bangalore
  India and Carnegie Moellon University
  - useful for "Keying-in and store" the scripts of Indian
    language using QWERTY keyboards
  - useful processing and writing modules/rules for
    letter-to-sound, text normalization etc.

Issues in TTS rendering in IL
- TTS should be able to pronounce words as AKshara
- Lanugages have heavy influence

- <phoneme alphabet="itrans-3" ph="  ">
  but
  <syllable alphabet="itrans-3" syl="naa too">...

- Motivation for Loan Word <alien>
  - BANK has to be pronounced as /B/ /AE/ /N/ /K/
  - /AE/ phoneme

Dialect Element
- To load language rules

Conclusions:
- proposed: <syllable>, <alien>, <dialect>
Discussion: How should dialects be supported? What are the shortcomings of RFC3066? –[RFC 3066bis article]
understand the troubles

Dan: This come in another workshop
- Distinguish written and spoken language
- there is new version of RFC3066bis
- is that sufficient or separate markings?

Joshi: In Indian languages 16 standard
  languages + 3 more languages
- they are spoken, not written

Nixon: trade off active implementation
  perspective, very inefficient load resources

Ouyama: Similar problem in Arabic
- component text-to-phoneme
- in syllables are too many, diphones to save
  
PB: 
- try to clarify different ways in SSML today:
  - phoneme
  - lexicon
  with possible extensions to deal with ambiguities:
  - token role -> lexicon
  - token xml:lang

GezaN: 
- not for engine developers, but for application developers
- dialect or languages are the same, if different you need
  a different engine
- proposal from Univ. Hung

Chris: 
- we have this discussion because IL share the same phoneme
- if big difference, create a new engine

Dan: is xml:lang enough?

Nixon: Yes

Richard:
- xml:lang is the text processing directive
  it is the content of the element
- there is need of other directives for other activities
  like loading lexicons, changing voices, etc.

Dan: 
- kick off discussion
  xml:lang is doing two purposes
  add a new attribute like "speak-as" to specify the
  target language
- SSML 1.1 will discuss the new xml:lang to understand
  if a single attribute is enough or a second one is
  better

 

(LUNCH)

Session 3: Syllables / Tokens

Moderator:
Paolo Baggia
Scribe:
Max Froumentin
Geza Nemeth and Geza Kiss: Proposals for Extending the speech Synthesis Markup Language (SSML) 1.0 form the Point of View of Hungarian TTS Developers –[Slides]

Topics include: speaking-style, syllable structure, phoneme language, parts of speech, phonetic alphabet for Solvenian, foreign text

Davide Bonardo: SSML Extensions for Multilanguage usage –[Slides, wave1, wave2, wave3, wave4, wave5]

Topics include: interpret as, <token>

Discussion: How should syllable structure / token be represented in SSML?
Paolo: Loquendo prefers "token", a term not too linguistically marked.

Dan: I agree. We've had interesting related discussions regarding Chinese. I've heard
  3 issues in all: 
  - unit demarcation, unit might be word, token, morpheme, syllable, phonemes, mora.
  - unit activities: change pronunciation, emphasize, timing, stress volume (phoneme element today)
  - link to lexicon: token is used so that there's a tie to lexicon. SRGS has tokens, which
    are tied to lexicons. There has to be linkage that's clearly understood between SSML and PLS

Nixon: SISR?

Dan: SISR is a separate processing step. The first is to match the sounds to words, the second
       is to map words to meaning ("coke" and "coca-cola" both map to coca-cola). SISR is used to 
       setup this mapping.

Paolo: if we know if it's an SMS, it helps because you can have many acronyms ('mtfbwy'). ASR
       shares similar problems. But in SRGS there is already <token> for things similar to this.
       e.g. "New York" has corticulation: better to keep together so you mark it up as 
       <token>New York</token> in SRGS.

Geza Kiss: Chinese word boundary detection is important.

Dan: yes, we talked about it in the group. The Chinese don't care about the element name
     (word, token), they just need it.

Chris: Question is do we delegate a lot of responsability to the TTS?
       Can we assume TTS can handle SMS script? Also, it's very useful
       to combine POS with inflected languages. This additional information
       is useful. Finally, about emotional. About italian within English, 

Paolo: SMS, emotions are for later discussion. Right now we want to
       talk about word segmentation.  for languages, we have to draw a
       line otherwise we're going to redefine the whole of SSML. What
       we offer are a few ways of changing the English if it doesn't
       well. SSML 2 may go beyond that.

Jerneja: with respect to token/word, it's very important to have PoS for highly
      inflected languages.

Raghunath: phoneme/morpheme/word/sentence. The word has to be splitted
    in some way. So it token similar to morpheme?

Paolo: yes, similar to morpheme and other things. Phoneme is a piece of something but
       this something is different according to the language. 

Raghunath: but morpheme has semantic bearings

Paolo: either we do line morpheme which has a precise and technical meaning. Or we go for something
       practical: an element with no precise definition but which works for splitting words.

Geza Nemeth: an attribute to <token>. InChinese, you have to
     differentiate prosodic, pronunciation unit.

Ksenia: can you add tones for African languages to token?

Paolo: yes, that was a proposal from the Chinese participants. Token
       is decomposition with characterisation.

Dan: we may put features in SSML which we leave half-specified and
     flexible, but we do have to know the linkage with the lexicon,
     even if the lexicon entries are not well-defined either. What's
     most important for the Chinese turned out that using an alphabet
     that's syllable-based did most of the work. Here the concern is different:
     what do we want to do with the segmentation offered.

Geza Kiss: a token cannot mean a syllable. SSML says so.

Dan: yes, if you have a one-word syllable, then yes. Otherwise no. To
     me token means word, except I don't want to say word, because of
     some languages.

Paolo: you're saying: we're missing <token>, and <phoneme> could change semantics.

Oumayma: you map the phoneme from the mother language: do you have look-up
tables to map phonemes? Have you done statistical analysis?

Davide: each language is a table of phoneme, and we have a patented
        algorithm that does the mapping, based on linguistic
        classification. We do it for all 18 languages we support.

Kazuyuki: about Japanese tokenization processing. Japanese word is
          morpheme: not only grapheme, but PoS. Second problem is that
          there are many compound words in Japanese [shows example on
          whiteboard: /ura niwa / niwa / niwa / niwa tori / ga
          iru/. That's a problem for Japanese TTS, which have to do
          analysis. Japanese lexicons have both separate and compound
          entries.

Paolo: so useful to tokenize? 

Kazuyuki: yes, for separate and compound

Paolo: se recursive token

Geza Nemeth: not compatible with SSML 1.0

Dan: in SSML1.0, all examples were ones that where you tried to
     override the TTS.  So is this particular thing sufficient to fix
     the processors: what's the minimal thing we need to do. In this
     case, would one level of tokenization is sufficient?

Raghunath: quick comment on whiteboard. In sanskrit there are many
           compound words. Gives example ("Arun-Udei"?) with
           corticulation.

Dan: that example is in any language? [says "Idnstrd" for "I don't
     unstertand].  A TTS may be smart, but may not be able to tokenize
     everything.

Ksenia: in African tonal languages, you may need several level of tokens

Geza Nemeth: 1. upward compatibility with SSML 1.0? 2. I think there should
      be a subdivision of token, otherwise it's a mess. 3. SSML could
      be used after semantic analysis from text, to be passed to
      synthesizer. 

Dan: there was an example given in Chinese: a poem which, according to
     where the boundary was, meant one thing or the opposite. 

Max: so the ambuguity exists for humans too. Should SSML do better? guess?

Dan: the engine that generates the SSML necessarily adds semantic
     information in any case.

Paolo: there is the problem of backwards compatibility and scalability
       for the future versions. You'll want more than <token> so will
       add lots of new elements.

Przemyslaw: in arabic TTS, tokenize text is also important, then
      vocalize. One level of tokenizing is enough.

Oumayma: Arabic is a syllabic language, so do you rely on this fact ?

Przemyslaw: in order to vocalize/vowelize, we need token markup. It's easier.

Oumayma: I don't agree. Is he working on the signal or on the text. On
         the signal, I disagree, on text: a whole word can be a
         collection of phonemes, so it's easier to tokenize.

MAJOR POINTS:

  - unit demarcation (approximately word-level and below, i.e. not
  subphrases of a sentence): unit might be word, token, morpheme,
  syllable, phonemes, mora. We agree it's important and <token> might
  be a good short-term solution.

  - unit-level processing: change pronunciation, emphasize, timing,
    stress volume (phoneme element today)

  - link to lexicon: token is used so that there's a tie to
    lexicon. SRGS has tokens, which are tied to lexicons. There has to
    be linkage that's clearly understood between SSML and PLS. 
    If the token is at the word level then it can be marked as another language.

  - token and xml:lang 
    Dan: Do you need lower than Word-level language identification?

Geza Nemeth: in German or agglutinative language. Compound words with 
    different pronunciation requirements.

Richard: again, it's the question of what a word is. "teburu-ni-wa"
    is one word? It has English and Japanese.

 

(BREAK)

Session 4: IPA / Phonetic Alphabets

Moderator:
Kazuyuki Ashimura
Scribe:
Paolo Baggia
4. Raghunath K. Joshi: The Phonemic model from India for Bi-modal Applications –[Slides]

Topics include: IPA and phonetic model

Model for Multilingual communication (textual/verbal)

Deshnanagari - a common script for all Indian languages
Multilingual happenings - social events in Mumbai

Non semantic - Sound Poems

Collaborative research on notation system for Indian music
with Dr. Ashok Ranade

Manual typographic activity for many years
(syllabic breaks and meaning breaks)

Indian Oral tradition had a long history

Veda families
from Oral → Text → Phonetics → Grammar

Definition of Phoneme (Vamas)
Vowel - Consonants

Definition of Phonemes (2000 B.C.)

Formation of articulate sounds and mode

Speech Related Issues

Indus Signs

Brahmi script

Consonant sound + Vowel sound renderings
in different scripts

+ accent marks rendering of Vedic Sanskrit

Concept of InPho
- Correlation with IPA
- Proposal of phonemic codes

Range of IPA

InPho Issues
IPA Issues

Position Statement 

Concrete Text  ↔ Stylistic speech
Formatted Text ↔ Synthetic Speech
Simple Text    ↔ Monotone Speech

Conclusion:
Discussion: What phonetic/phonemic alphabets should be used for SSML? Is IPA satisfactory for representing the pronunciation of words?
Ksenia: Include Schwa in SSML text, instead of
  complex processing

Richard: In English you need to use the lexicon
  for everything

Ksenia: Not possible enumerate for highly inflected
  languages

Richard: Hindi is present

Ksenia: It is an issue to eliminate or not.

Przem: Also in Polish, it is not a solution even if
  it can help.

Joshi: All the Indian languages are based on Sanskrit
  Add phoneme together

Nixon: In the languages we have done, they are solved
  by additional Schwa

Dan: Similar issue of what occurs on Hebrew. 
  SSML is to mark text an human can read. If the processor
  has a problem an alternative pronunciation is given.

Richard: If you look Hebrew and Arabic is more and more,
  Hindi is much less. For SSML you let the processor to
  do more. If you can do that, you need to do that.

Nixon: What we did was exactly that, create a dictionary
  and then 

Paolo: On defence of IPA
  - It is one way of writing pronunciations, many
    drawbacks: difficult to type, to read, but part of
  the problem is that

Dan: At first I strongly disagree with Paolo.
  Many people do not like IPA, especially in China,
  it is taught in school to children. Is easy to type.

Richard: Totally agree from Dan and Paolo.
  But even for Chinese if you need allophones you will
  use IPA. 
  Back to Schwa, if there are morphological rules?

Ksenia: If there is a morphological boundary one rule,
  if not another rule.

Richard: If you add a "virama" sign, you will give the
  missing information.

Dan: If there is a way in the script to adjust the text.
  It could be done manually or by pre-processing.
  There could be concerns for Accessibility to do this.

(Discussion on Bhrigus in Indian)

Ksenia: Why you have <sub>?

Dan: it is a facility.

Chris: IPa is useful as a common ground. You can create
  a resource. Prescribe to use other alphabet, but give
  a presence.

Richard: This is way you have markup. You can add the
  metadata to describe the difference. [...]

Dan: I was thinking to create a registry for alphabet.
  With a process to create an alphabet.
  This will be discussed in the group.

Kazuyuki: We will continue this discussion in the further
  topics tomorrow.

 

(DINNER)

 

Wednesday 31 May, 8:30-18:00

Session 5: Multilanguages Issues

Moderator:
Richard Ishida
Scribe:
Max Froumentin
Kimmo Parssinen: Development Challenges of Multilingual Text-to-Speech systems –[Slides]

Topics include: fallback language, language mappings, preprocessor, multilingual

Zdroik Przemysław: Position paper for 2nd W3C Workshop on Internationalizing the Speech Synthesis Markup Language (SSML) –[Slides]

Topics include: token element, missing diacritics, word stress

Discussion: How to represent foreign words?

Ksenia: african languages?

Kimmo: we probably support Zulu and Afrikaans, but they're not
      available before the UI supports does, and currently S60 phones
      don't have it.  But a political decision is that all UI
      languages are supported

Nixon: footprint of ASR?

Kimmo: about 200 KB, TTS engine is 20-30 KB + language data

Richard: spoken vs written languages?

Kimmo: processor wouldn't change the voice, but would do their best

Max: basically what Davide sugested

Paolo: element or attribute?

Kimmo: as long as it's understandable, and that the requirements are fulfilles

Géza: could use say-as with lexicons for that. There could be 2 layers
      of lexicons, one provided by TTS engine provider, one by application
      developer. Helpful to be in the standard in some way. However I'm afraid
      about the same lexicon for ASR and TTS.

Paolo: we try to accommodate both ASR and TTS needs in PLS lexicon. 
       The user would be used for very few adjustements. In many simple cases,
       you want adjustments both in ASR and TTS.

Géza: in ASR, how would you relate the lexicon to the standard grammar?

Paolo: both grammar and ssml have the lexicon element, so you can
       refer to one lexicon from both. What's only missing in the
       standard, so with PLS we're trying to address that simply

Kimmo: yes, the key here is the standard.

Paolo: about the list of preferred languages, xml:lang doesn't take
       more than one, so we would need to have a new attribute.

Richard: for later discussion...


Paolo: can we have more than one attribute in xml:lang? If not, we
       need another attribute. 


Richard: you could, in principle have multiple values in xml:lang, but
         I strongly discourage that, in order to align with other specs.
         xml:lang may serve as a default

Kris: xml:lang describes the content, but it should not be overloaded
      with additional meaning, giving instructions to the TTS. It should
      be another attribute with extra semantics.

Dan: yes, there is a need to distinguish between written and spoken
     attributes.

Max: xml:lang would then be not used

Paolo: no, we need both as hints to the TTS. Today isn't clear.

Géza: suggest <voice lang="...">...</voice> 

Oumayma: you can create a synthesizer for all languages, and SSML should support that.

Przemysław: it's very hard

Géza: don't forget the problem with "unknown"

Paolo: just checked, and found that xml:lang is required on <speak>. I suppose
       that "unknown" would be handled with no xml:lang

GézaK: doesn't work with subparts that we want to mark unknown.

Richard: xml:lang could be empty "", it may mean "unknown" or "this is not a language", not sure.
         ISO-639 has an "unknown" tag, I should check. In principle you can have unknown anyway.

Dan: whichever way, then if there is an existing way, then we can use it

GézaK: worth mentioning in the spec, though.

Kazuyuki: language specification can affect both SSML and PLS. Is it interesting to 
          specify separately in PLS, i.e. select PLS.

GézaK: each PLS has one language, so there could be a selection
       according to the language specified in the SSML instance.

Richard: but lexicons are specified at the top of the SSML document.
         e.g. <speak xml:lang="en">  The French word for chat is
         <xml:lang="fr">chat</xml:lang></speak>

Paolo: you can have 2 lexicons after <speak>: english.pls and
       French.pls. Problem is that the engine has to load both to know
       the lexicon languages.  [dan reads from SSML spec about
       lexicons]

GézaK: if there isn't a French "chat" in the lexicon, then you should
       use the English "chat"

Dan: if you had en-GB and en-US in the lexicon, which one do I use if
     I get lang="en" ? So the matching is not simple. We could specify
     on language, but the lower parts (region/dialect) is up to the
     processor.

Oumayma: in written text, foreign words can be in italics or
         surrounded by quotation marks.  That provides hints that
         the output may sound strange.

Dan: but not necessarily, e.g. internet, webcast, iPod, etc. in German.

Richard: coming back to question

Nixon: we don't have to be so specific, just a tag marking as "alien"
       and let the processor handle it. Going back to "unknown".

Przemysław: you may also find "chat" and in 2 lexicons.

Max: if you have in English: "tonight we're showing <<la vita e bella>> and <<les quatre-cent coups>>".
     Do you use different normalisation rules for each foreign language part?

Przemysław: yes.

Chris: arguing for synthesizer to find the best way of pronouncing the
       best way to pronounce a word. Easy in Greek, because you can
       easily find out foreign words with the scripts.

Dan: just like "foreign"

Richard: doesn't apply to all languages.

Dan: synthesizing: people want to avoid changing voice when they want
     to change language.  You want a piece to be spoken in some
     different way. In what way do you want to be spoken in a
     different voice (which could be the same speaker...)

Géza: 3 ways. 1. phonemes, 2. prosody, 3. accent. So far we haven't said anything
      about 2, we haven't separated 1 and 2.

Dan: sometimes you don't even want to pronounce film titles, in widely 
     different languages you'd just translate.

Jerneja: movie titles aren't the best examples: people's names are most important.
         You can't skip or translate them.

 

(BREAK)

Session 6: Disambiguation of Multiple Pronunciation

Moderator:
Jim Larson → Paolo Baggia
Scribe:
Richard Ishida
Oumayma Aldakkak: Computational Methods to Vocalize Arabic Texts –[Slides1, Slides2, Slides3]

Topics include: vowel length identification, POS, type of emotion (mainly mentioned in attachments: SSML_A.dpf.pdf, final_Emotion.pdf)


Paolo: What means: Incorporation of the localization module?
Module takes written language and adds vowels
SSML is unable to work on text fed from news feed or such
 

Nixon: How do you handle syllables?

OD: Grapheme to phoneme: provide vowels
Then convert to semi syllables
 

Nixon: What are the units you use?

OD: diphones
 

Demo of vocalizer, shows that there are multiple possibilities for a number
of words/phrases
Choice of alternative made by unsupervised learning algorithm
Jerneja Zganec Gross: Considerations on using PLS for Slovenian Pronunciation Lexicon Construction –[Slides]

Topics include: pronunciation style, emotion, dialects, pron-source, POS


Paolo: Is Multext-East LRs an standard that is internationally known?  We
would prefer to use existing standards rather than create ourselves.

JZG: not an ISO standard, but developed with fairly wide group of people

Chris: Who is the intended user for PLS?

Paolo: the application developer - JZG has just mentioned use for internal
development - this is a possible extension, but it would be a great deal of
work
Discussion: How should multiple pronunciation identified and disambiguated in SSML? Is parts of speech useful enough? Should emotions be included in SSML?

Paolo: what can SSML (the interface to the TTS engine) do?  Do we need
special markup?
 
Paolo explains that the SSML can be associated with other preprocessors or
the work can be done in the TTS.  SSML could be used to pass messages to the
TTS.  
 

Dan: You can't do anything inside the SSML document - it's a document.  But
you can say what changes need to be made by a processor.
 

OAD: Vocalization should be incorporated in the TTS.  POS markup in SSML
would help the vocalization.
 

GezaN: Can you add POS information to 'words'?
 

RI: No, arabic words carry multiple morphs and you need to mark these up to
solve the problem
 

Kimmo: Why would you mark up the text rather than just add the vowels?
 

Paolo: The idea is that you can add part of speech to help the vocalisation.
 

OAD: Sometimes text in say a news feed is partially vocalised where
ambiguity needs to be avoided.
 

Paolo: So SSML could include information that text is not vocalise or
partially vocalise or vocalised.
 

Kazuyuki: I think vocalization is not special processing but text analysis
included in TTS. Is the input of vocalization plain text?
 

OAD: Yes
 

Kazuyuki: So vocalization is a module of TTS.
 

Paolo: But you could also preprocess the SSML text.
 

Dan: Main reason for markup in SSML is to direct processing that follows.
So either you do the vocalization no your own, or you need some assistance
from the author to help with the vocalization.  Only thing that seems to
have been mentioned so far is morphological markup.  We also heard that
there may be several morphs in a single word, and that doesn't exist at the
moment and I cannot figure out how to make that happen without extensive
work.
 

GezaN: That also applies to Hungarian, Slovene and other agglutinative
languages.
 
Discussion about whether Arabic and other agglutinative languages are the
same wrt POS.
 

OAD: I don't think we need to put markup in SSML for morph.
 

JZG: But surely it can help?
 

Pzremek: But why use markup if you could just add the vowels to the text?
 
Dan explains the distinction between automatically processing text such as
news feeds via handcrafting text that will be often reused.
 

Geza: We should conserve the original text and use markup to annotate it.
It would be useful to provide markup for POS because it is also useful for
prosody and other things.
 

Chris: SSML is not an annotation tool. It won't be used for morphological
annotation. If we add more and more information, where do we stop - why not
just provide phonetic transcriptions.
 

Dan: SSML is a language of hints only - a processor can actually ignore
almost everything specified - almost every TTS vendor  has smart TTS
processors and may not agree with what other people suggest.  Since it is a
hint language, we need to consider what kind of hint is useful, on the
assumption that the TTS is pretty good most of the time.  POS markup (incl
gender, case, etc) was not implemented because of the difficulty of working
out what the labels should be.  Perhaps we can provide a mechanism for
people to annotate the way they want.  It's useful for people who don't know
phonetic alphabets but do know the POS info, and another group that.....
 
One of the best examples is numbers - needs to match in gender, case, etc.
TTS needs to respond and knows you are talking about pencils or tables, and
you could tag the number to say what the gender etc is.  Maybe in the future
we will be able to standardise POS, but it's probably too early right now.
 

Nixon: Is it important to keep the original text when vocalising?
 
Agreed to talk about that later.
 

GezaN and Dan: Summary: There is some interest in trying to standardise at
least a subset of the possible values.  There is broad interest in enabling
labelling.
 

JZG: There may be some ISO work going on in this area.
 

Paolo: We'd be very interesting to find out about that, because we can't
handle this ourselves.
 

Kazuyuki: JEITA people suggested in Japanese TTS POS is useful but it is not
used in input info for TTS, but Ruby is used.
 

Dan: It's still not clear to us whether ruby needs to be in SSML.

 

(LUNCH)

Session 7: Say-as Topics

Moderator:
Dan Burnett
Scribe:
Paolo Baggia
Chris Vosnidis: Position paper for the Second W3C Workshop on Internationalizing the Speech Synthesis Markup Language (SSML) –[Slides]

Topics include: inflection, format, details, alias maps

Discussion: How should the <say-as> tag be extended?
Motivations:
- Greek is heavily inflected languages
  nouns, adj, verbs
- Several inflectiona- relato
- How does inflection word
  - Inflection attributes are shared between certain
    elements in the same context
  - Elements might not be neighboring

Inflection definition in Say-as
- provide hints to the synthesis processor
  - which inflected version to use?
  for: case, number, gender
  Ex. number 3

We need context sensitive substitutions
- aliasmap element with inflections

Additional <say-as>
- there is no template for describing the way token
  should be rendered
- new major version should address
- two small changes to existing say-as
  - use details for date
  - use format for pjhone

Say-as Telephone: format

<say-as interpret-as="telephone" format="3223"
  details="30">2156664444</sayas>

Say-as Date: details 

(Clarifications on the examples)


Paolo: the aliasmap seems to be related to the lexicon


Chris: yes, but you need to change to standard


Max: the "k" is difficult for lexicon, because you can expand
  as "kilos" and also "kilometer"


Paolo: All of them can stay in the lexicon, you will use
  the role to reference them, but in this example there is
  also the inflection. This is a problem for PLS


Dan: Is it possible to grasp the inflection from the sentence?


Chris: Not simple. 


Dan: The way was designed is more an "interpret-as". It is
  not about the rendering. There is not enough information.


Richard: if were dealing to news fedd, this kind of markup
  will be used several time.
  Can be done in case by case case?


Chris: Orthography for numbers is complex and they can be
  generated from a database. I don't want to do
  a complex as a pre-processing. 


Richard: The issue is if the say-as useful for doing this?


Przem: We don't know the number of e-mails.


Paolo: I agree this is difficult to be done on lexicon.


Dan: Why not to use the lexicon?


Paolo: Because there is a number "3 k." and you cannot
  put all the numbers in the lexicon


Dan: Ok.


Max: If you go in that direction also the example of read 
  for Russian will expand in very large number


Paolo: You are right, but a general solution for highly
  inflected and generative languages is for a future
  version. We would like to discuss


Richard/Dan: Relation of context and numbers in Japanese
  and Russian. You cannot pronounce unless there is the
  number.


Oumayama: Explanation of variation of gender and case for
  numbers.


Dan: Say-as is a tricky element for many reasons.
  SSML really does not need say-as, because it is a way
  to convey a semantic category. Everybody like "date",
  but "cardinal" seemed to be minor for many languages.
  This requires a separate effort for say-as, because
  too big.


Chris: Returns on the point of NMTOKEN to CDATA
  to include values like 
  details="blah?param1=xxx&"


Dan: We need to answer what we want to accomplish.

Discussion on telephone number reading:
   New use case: When you say a phone number you may not
   want to use the normal grouping for that country
   for understandability.
   Example for a greek person an US number is read in
   a certain way

(discussion on Dan's questions)


Ksenia: Introduce spelling?


Chris: is already in the W3C Note called "character"


Paolo: The real issue is if we want to restart this 
  activity and if there are enough people interested
  in it. The current situation is not clean for the
  standard.


Dan: I agree with Paolo. 


Richard: The question I have is how can i do spelling
  in Arabic and Indian. 


Oumayma: We have 28 letters, we spell them. We say
  the name of the letter "a", "b" (not /bi/). We do
  not pronounce short vowels in spelling. 
  The spelling is describing how it looks in the
  paper.


Richard: What about Indian languages?


Prof. Joshi: Example spell: W3C.
  No plural, but "doubleyou three ci"
  Description of phonemes.


Richard: Ex. spelling "kamra" (that means "room")


Joshi/Nixon: In Hindi the syllables are pronounced
  but if the people will not understand the single
  phoneme will be pronunced. 

 

(BREAK)

Session 8: Remaining Topics

Moderator:
Dan Burnett
Scribe:
Max Froumentin
Discussion: remaining topics and new topics arising during earlier discussions
Remaining topics, with count of interest


* specialized content (SMS, email) : 11 GezaN: problem is that it would be very useful to be able to find the characters that the network is supporting. Synthesizer may generate differently with character set. Also SMS should be part of say as. Also link to pronunciation lexicon. Przemysław: SMS centers cut diacritics Max: if xml:lang="pl" and encoding="us-ascii" and style="sms" => then infer diacritics GezaN shows example: diacritics are removed, but sometimes only some of them
* speaking style: 8 Ksenia: it's science-fiction at this stage. So many parametersneeded. GezaK: we already have speaking-style="casual" Paolo: the values are difficult to define. Open list? Max: is it generating the style voice, or the values? Paolo: you can have a specific database for a style. Up to the processor if it has a news style. Maybe change voice. Ksenia: don't think that's up for standardisation. Paolo: it is a way of addressing the problem, it is possible to customise your TTS. SSML doesn't let you specify to accuse Ksenia: never heard one with good speaking style GesaN: they exist. Oumayma: about style synthesis, you add that later Max: language tags? RFC 3066? Richard: not appropriate a language Raghunath: speaking style is important. There's a parallel to handwriting style.
* Stress marking: 6 Przemysław: best quality TTS has special signs for stress indicators. There's no ways to do that right now, and you have to use IPA Dan: that's something you could use a lexicon for, and a phoneme. Would a different alphabet be a solution? In Chinese, pinyin allows specifying tone? Przemysław: one character would be enough. Maybe 2. Paolo: not SSML's job to standardise stress markup Dan: perhaps because many TTS stress differently, so they market differently. It's not clear how you would want to standardize. IPA has first and second stress. Przemysław: but not practical GezaK: can use <sub> Dan: wondering is the issue is standardising marking stress, or designing a language. Nixon: agree with fact that features should be left to application. Opens a can of worms how many ways to do it. Chris: there's IPA and alternative alphabets instead of IPA. Oumayma: it's all a question of prosody: we can do everything with prosody.
* stronger linkage to PLS: 5 Jerneja: several interfaces where PLS should link to SSML, e.g. PoS, maybe some speaking style (pronouncing a word casually or not).
* Emotions: 4 Oumayma: for emotions we play on 3 prosodic features (amplitude, contour, ??). In SSML, you could put a mark to have phrase or a word to be pronouced happily or angrily, which will result in modifying it's prosody Nixon: Ellen Eide has a good paper on that Dan+Paolo: she's involved and interested! Dan: asks who built a TTS engine and who's comfortable including emotions. result: little bit more than 50%
* making change via orthography or markup: 3 (<sub> or other markup or change the original script)
* metadata: 2
* african tones: 1

Session 9: Summary and conclusion

Moderator:
Jim Larson → Dan Burnett & Max Froumentin
Scribe:
Dan Burnett → Kazuyuki Ashimura
Discussion: What are the next steps?
Invite experts to the SSML subgroup of the Voice Browser Working Group
to update requirements and develop specifications.
VBWG need experts on the issues clarified in this workshop because we only
work on group interest. So we would like to invite experts to the SSML subgroup of
the Voice Browser
Working Group to update requirements and develop specifications. What contribution do you need? Max: If you're non-member: can get public draft, public mailing list, and your comments welcome. every comments will be addressed. If member & VBWG participant: can participate in f2f meetings, telephone conferences,
and can also subscribe internal mailing list. In addition, can participate in all other WG's activity like semantic web. Paolo: there're 2 steps to participate in SSML activity: (1) W3C member and (2) VBWG participants Max: When you become a VBWG participant, Patent Policy agreement is required. Nemeth: I'll check whether my organization is member or not. Paolo: research institute or university should be get a discount. Max: please visit W3C web site: http://www.w3.org/Consortium/join for the procedure. And please ask Max and Kaz about the details ;-) Conduct workshop(s) for other languages. Dan: In Beijing, there were contributions from China, Japanese, Korea and Poland. This time, we have many other languages: Greek, Indian, Russian, Arabic, Hungarian, ... What/Where to do next? Ksenia: How about South Africa? Max: Please join us and propose it ;-) Ksenia: Especially localized French or English in South Africa. Nixon: I'd like to suggest India. Richard: Another Workshop (not held by VBWG) will be held at CDAC
in India (=W3C India Office) in August Jerneja: There will be HLT conference Dan: Support for major language family like Slavic is required. Chris: Why not Turkey? Dan: Middle east is important. Paolo: 2 turkeys couldn't participate in this Workshop because of
schedule arrangement and deadline issue. we should have given longer time for participants... Nixon: Is there any local activities? China and india have big market. if there is local chapter, more people are available. WG might be big enough... Max:In fact, this Workshop was originaly planned in Turkey. But changed to Greece because of bird flu. Ksenia: How about Speecon in Russia? oumayma: Is individual membership available? max: People, who should participate because they have knowledge and skill, can
participate in WG as invited experts. Dan: Thank you for your thoughtful suggestions and comments. If nothing else, adjourned.

[Workshop adjourned]


The Call for Participation , the Agenda and the Logistics Information are available.


Jim Larson and Kazuyuki Ashimura, Workshop Co-chairs
Max Froumentin, Voice Activity Lead

$Id: minutes.html,v 1.21 2009/01/05 15:44:09 ashimura Exp $