W3C Internationalization Workshop
Position Statement: Appendix A: Language Tagging

Richard Ishida

Globalisation Consultant
International Document & User Interface Design
Xerox GKLS
http://www.xerox-emea.com/globaldesign/


What follows are some ideas from a mailnote sent to langtag@unicode.org in October 2001. The intent is to stimulate thought about the neglected needs for 'language tags'. This is referred to by my position statement for the W3C Internationalization Workshop.


This note builds on some thoughts I had earlier this year that were resurrected by the Unicode language tag panel in September.

Although we certainly need to address the need for additional codes, this is only one of several issues that exist for labelling text data, certainly in the localization industry and probably elsewhere. This note attempts to outline some of the other aspects that I think we should be considering, and proposes (in a crude form) one possible approach that might help. This proposal is provided only to help the debate - I am not unalterably attached to it.

This note first introduces some of the requirements, then introduces a proposed format to address those requirements, then gives examples of the format for each of the requirements previously outlined.

Key features of the proposed approach:

Requirements

The following questions list some of the requirements I can see for labelling information using something along the lines of language tags in the localisation world and in some other potential applications I can imagine. Some of these requirements are already met by rfc3066, but most are not available at the moment.

Script

How do we refer to a script variant?

eg. Mongolian in Mongolian script vs. cyrillic

eg. Traditional vs. Simplified Chinese

How do we label an item of content by script without other information?

eg. Latin script, language unspecified

How do you indicate that this is a transcription of a language usually written in another script?

eg. a transcription of Urdu in Latin script

Language & dialect

How do we extend the language codes available?

How do we refer to language tags from other vocabulary systems?

eg. by private arrangement, a system may refer to SIL's ethnologue 3-letter tags throughout, or to supplement existing codes

How do we distinguish particular dialects?

How do we refer to a spoken language that has no (formal) written form?

eg. Swiss German

How do we refer to a language in a non-specific way?

Just French, not France or Canadian French. Catalan, whether it is in France or Spain.

How do we identify a language group, rather than a specific language?

eg. romance languages

How do we identify a dialect without being specific about region?

eg. romany

How do we know whether a language tag refers to a written or spoken form of the language?

eg. is de-CH by default the German written in Switzerland (or spoken on the Swiss TV news), or the schwyzertuutsch spoken in everyday conversation?

Locales

How do we refer to a language that is generalised to fit a geographical area that comprises numerous more detailed locales?

eg. Latin American Spanish

How do we specify a locale that fits a particular group of countries?

eg. French-speaking communities in Belgium and France, but not Canada.

How do we specify a geographical area without mentioning language?

eg. voltage settings for Canada

History

How do we identify a language variant along historical lines?

eg. Chaucerian English

Other

How do I label ambiguous text?

eg. Jean put dire comment on tape.

How do I remain ambiguous about multiple languages?

eg. a document equally in French, Italian and German

How do I specify multiple langauges/scripts/locales?

eg. a document equally in French, Italian and German

How do I mark an unidentified piece of content?

General

How do we maximise interoperability and standardisation by incorporating current language tag usage into the extensions that will be developed to meet more sophisticated needs?

Is there a need for a more rigourous standardisation of locales?

To what level does one need to specify a locale?

eg. is it OK to use fr for French French and fr-CA for Canadian French, or should one use fr-FR and fr-CA?

How do we avoid different but equally valid labelling?

eg. a dialect that has a three letter primary language tag but also a denomination via a 2-letter tag plus subtag

Tentative solution proposal

This proposal attempts to build on RFC3066 and provide maximum opportunity for conformance with current implementations while providing a mechanism for extending the sophistication of the language/locale tagging. My hope is that it will provoke discussion at least of how to answer some of the above questions. I haven't elaborated the logic to the nth degree yet. I just want it to provoke ideas about how to proceed.

[A] Tag order could be more constrained than currently, something like the following:

primaryLangTag ('-' geographicalRegionTag ('-' dialectTag)*)? ('/' scriptTag? ('/' historicalEraTag)? )?

Examples: en
en-GB
en-GB-scouse
en-GB-scouse-xxx
en-GB-scouse/LAT
en-GB-scouse/LAT/1950CE-1963CE
en-GB-scouse//1950CE-1963CE en-GB//Chaucerian ur/LAT

[B] Primary LanguageTag

[C] Geographical RegionTag

[D] Dialect Code(s)

[E] Script Code

[F] Historical Era Code

Example solutions to above questions

Here I take the same questions we saw above and try to propose how they would be answered by the tentative proposal just made. (I will make up many of the codes - the important point is the syntax).

Scripts

How do we refer to a script variant?

eg. Mongolian in Mongolian script = mn/MNG
eg. vs. Mongolian in cyrillic = mn/CYR
eg. Traditional Chinese = zh/TCH
eg. vs. Simplified Chinese = zh/SCH

How do we label an item of content by script without other information?

eg. Latin script = z/LAT

How do you indicate that this is a transcription of a language usually written in another script?

eg. a transcription of Urdu in Latin script = ur/LAT
[Note: we may need to be able to specify the method of transcription, eg. ja/LAT-Hepburn]

Language & dialect

How do we refer to language tags from other vocabulary systems, eg. SIL?

eg. French using the SIL vocabulary = sil:frn
eg. could still be related to geography, so Channel Islanders may speak = sil:frn-GB

How do we distinguish particular dialects?

eg. as before, but with a geographic region: scouse = en-GB-scouse

How do we refer to a spoken language that has no (formal) written form?

eg. written Swiss German = de-CH
eg. spoken Swiss German = sde

How do we refer to a language in a non-specific way?

eg. Just French, not France or Canadian French = fr
eg. Catalan, whether it is in France or Spain. = ca

How do we identify a language group, rather than a specific language?

eg. romance languages = RMN

How do we identify a dialect without being specific about region?

eg. Scouse dialect, wherever ir' is spoke, like = en-z-scouse

Locales

How do we refer to a language that is generalised to fit a geographical area that comprises numerous more detailed locales?

eg. Latin American Spanish = es-LAM

How do we specify a locale that fits a particular group of countries?

eg. French-speaking communities in Belgium and France, but not Canada = fr-FR,BE

How do we specify a geographical area without mentioning language?

eg. voltage settings for Canada = z-CA

History

How do we identify a language variant along historical lines?

eg. Chaucerian English = en//Chaucerian
eg. Ainu in Japan in the 5th century = ain-JP//400CE-500CE

 

Obviously, if we were to implement something like that described above, we should do so soon if we want to ensure maximum compatability with current usage of rfc3066.

Again, my key interest here is to widen the debate about the needs and possible solutions relating to text tagging - not necessarily to propose a particular solution. I hope this is helpful,