Jim Larson – Chair, VBWG

Section 2 – Diacritics for Auto-completion

Scribe by Helen Meng

Przemyslw Zdroik and Krzysztof Majewski represented Polish telecom and gave a talk entitled, “Telekommunikacja Polska: An extension to the <say-as> element for diacritics auto-completion”.

The speakers came from the R&D center of Polish telecom and they work on the vocal services section.

They outlined four subjects related to the topic:

Nature of the problem

Similarities among other languages

Possible solutions

Discussion

Definition of Diacritics

diacritical mark or diacritic, sometimes called an accent mark, is a mark added to a letter to alter a word’s pronunciation or to distinguish between similar words.
Polish alphabet contains 35 letters = 26 basic + 9 with diacritics
Different pronunciation from letters without diacritics
Included in ISO-8859-2, UNICODE, CP-1250, DOS 852…
Not included in US ASCII 7-bit codepage

The reasons why polish diacritics sometimes disappear

Not possible to type or difficulty to type

Create difficulties for codepages

Pruned on WWW-SMS gateways

The emergence of quasi-Polish text (without diacritics)

Is not orthographicaly correct

Not up to netiquette

Is not Polish

Cannot be transformed into polish with simple substitution rules

Speech synthesized from this text may be incomprehensible

The fact about quasi-Polish:

sometimes it is the only possibility to represent text
is easier to write = can be written faster
can be quite easily read by human as if it was written correctly (because of the nature of human cognition)
Similarity with other languages, e.g. Czech, Slovak (very similar to Polish) – Slavic languages, German, Russian, French, and many others. There is an informal Romanization used in SMSes.

Discussion → How should we classify the problem?

Is quasi-Polish a new dialect?
Context-dependent orthography
An erroneous text that requires correction (jargon)?

Consider the example of quasi-Polish as a language of communication in instant messaging

Should the textual correction (diacritic completion) be incorporated in the instant messaging core?
Should the diacritic completion be incorporated in the external lexicons?
Should the diacritic completion be incorporated in the text normalization?

Question & Answer Session:

Should jargon tags imply ASCII tags?
Are there cases where it is hard for humans to disambiguate among possibly ambiguities in text without diacritics?
For example, young people often inventing short names for many words in order to save time in typing short messages
The instant messaging (IM) problem occurs everywhere – invented words and phrases – can those can be handled by a translation process using the lexicon
The problem is also similar to “read” (present tense) versus “read” (past tense), except that the problem is more extensive in languages with diacritics, i.e. there is newly created ambiguity due to the missing diacritics.
How should Polish be written in terms of an alphabet? For example, RFC3066 is used to represent language, zh-latin, zh-CN (for simplified Chinese), zh-Hans, zh-SG, zh-Hans (use 4-letter code for Chinese), etc. Can we have an analogous alphabet for Polish?
As an alternative, can we we represent the phenomenon using language codes and script codes, or just as new words in the vocabulary, such as TTYL (representing “talk to you later”)? Or do we need a different orthographic form?
We need to distinguish between the two problems of (i) the need for a new orthography, or (ii) the need of a conversion process that requires a lexicon. In “broken” text, we lose information and we may need to use additional resources, such as external lexicons, to perform recovery of information and related disambiguation (e.g. in semantic processing).
We can incorporate more embedded processing in text-to-speech synthesis, but PLS may allow us to attach the problem at a lower level.
To what extent is this ambiguity problem related to problems specific to SMS?