Position Statement Regarding I18N issues in the Context of W3C Activity

Peter Constable, SIL International

0. Introduction

One of the most significant characteristics of the development of the World Wide Web is the great extent to which state-of-the-art technologies are available to users that represent an unprecedented level of cultural diversity. Whereas a mere ten years ago many cultures had little access to information technologies, today thousands of distinct cultures have access to the Web. There are still some gaps, however, in the ability of technologies to support culturally diversity in areas such as information content. For example, a given culture may use a writing system that is not well supported by the readily-available software agents. Gladly, this is increasingly not a problem for major cultures, but there is still a significant gap for thousands of lesser-known cultures.

Certain categories of users also have particular needs that lie beyond those of the majority. In particular, researchers in the humanities and especially in areas related to language -- linguists, philologists and epigraphers -- often need to work with written information that is not yet well supported by technologies. The problems they face largely overlap those of the aforementioned group.

Related to this are some specific technology issues that are of interest to me and to SIL International and that I believe may be appropriate areas for W3C I18N activity. The fall into two broad categories: script-related issues, and language identification-related issues.

1. Script-related Issues

There are two general problem areas related to the Web that are of concern for SIL and that I think are of potential interest for W3C I18N activity. One is related to encoded representation of text information, and the other to rendering of complex script behaviours.

1.1 Issues related to encoded representation

The W3C character model and the Unicode Standard form a solid basis for dealing with problems of character representation. As they stand, however, they offer a solution only to the extend that text can be represented using the current version of the Unicode Standard. In particular, there are many situations in which users need to work with characters or even entire scripts that are not yet part of the Unicode Standard. Unicode provides private-use codepoints that can be used for encoding, but these are really only functional for internal purposes and are not useful for interchange between arbitrary users. Mechanisms are needed for allowing interchange of data between arbitrary users. This involves not only some means for representing the non-standard character entities within a document but also for documenting their semantics.

Related to this is a problem (especially significant for philologists and epigraphers) of being able to identify not only particular characters but also specific glyph variants. Again, this is a two-fold problem of needed some mechanism for representation of the particular entity within a document but also for documenting the semantics associated with that entity.

Also related is a problem of existing practice and legacy data using non-standard encodings. Over the years, many, many users who have needed to work with writing systems that were not well-supported by software agents have resorted to creating their own solutions, typically involving custom fonts that used custom character encodings. In some cases, these solutions were merely addressing a lack of support for certain special characters not found in major encoding standards; e.g. a schewa for phonetic transcriptions. In other cases, these solutions were also addressing a lack of support for complex script-rendering behaviours; e.g. creating presentation-form encodings that included multiple positional-variants for diacritics such as tone marks. While the preferred solution is for users to encode data using Unicode and to use software agents that support complex script-rendering behaviours as assumed by the character-glyph model, there is a significant amount of existing legacy data that cannot be quickly converted. Moreover, users are not always yet prepared to give up their legacy solutions since not all software yet provides adequate support for Unicode and complex script-rendering technologies. Thus, there is a need to provide means for documenting legacy encodings, including those that use characters not yet in Unicode or that use presentation-form encoding methods. The Unicode Consortium has begun to address this need in UTR#22, but that work has not yet advanced sufficiently to provide a means to describe the full variety of legacy encodings in use. Furthermore, such mechanisms for documenting legacy encodings need to be integrated with the aforementioned mechanisms for documenting non-standardised characters and glyphs.

1.2 Issues related to rendering of complex script behaviours

A slowly increasing number of vendors are beginning to provide support for complex script-rendering behaviours to accommodate the needs of scripts such as Arabic and Thai. Some are relying on OpenType / Uniscribe support built into Windows; other efforts are being made to add similar support in other environments, such as Java 2D. One common concern with all of these efforts is that they target major markets and thus provide support for scripts such as Arabic and Thai, but they do not address the needs of smaller markets. Thus, support for scripts such as Khmer and Lao is rare; support for non-standard use of a complex script as might be required by lesser-known languages is non-existent (e.g. Mon-Khmer languages spoken in Thailand would use Thai script but would require diacritic combinations that are considered impossible for Standard Thai and so are not support in existing implementations).

W3C has introduced a new font format as part of the SVG recommendation. This font format is in need of mechanisms for dealing with complex script-rendering issues just as much as any other font technololgy, such as TrueType. What is distinctive about SVG fonts, however, is that they are represented entirely as XML documents and so require a means of declarative definition that is amenable to textual representation. This would apply as well to any information added to a font to support complex script-rendering processing. Interestingly, a declarative, textual representation of complex script-rendering behaviours could facilitate implementations that are non-script-specific, meaning that the dichotomy between major scripts and lesser-known scripts would disappear.

To advance I18N issues on the web, therefore, additional work on SVG is in need to provide support for glyph transformations as needed for rendering complex scripts. Most existing technologies (e.g. OpenType) are not especially amenable to textual representation using XML, however. On the other hand, SIL International has developed a complex script-rendering technology known as Graphite that may provide exactly what is needed. Specifically, one of the components of the Graphite technology is the Graphite Description Language, a high-level language for describing visual script behaviours. The GDL language uses a pattern match / replacement processing model akin to that used for XSLT. While GDL was not written as an application of XML, the language is certainly amenable to an XML-based represenation, and we have begun to create an equivalent description language using XML.

2. Issues related to language identification

W3C protocols have relied on the Internet recommendation RFC 3066 and, indirectly, the ISO 639 family of standards. These have been adequate for many common needs, but users are increasingly encountering their limitations. The sources of limitation are, I believe, threefold:

a) Categories in ISO 639 are not defined on a consistent basis and are poorly documented, hence users often do not know what an identifier is supposed to mean and when it is appropriate to use it.

b) Language coverage falls far short of what users are needing.

c) "Language" tagging is applied to data in the absense of any ontological model for what kinds of language-related categories information should be labelled. We use terms such as "language" and "locale", but often it is unclear what it is we are trying to qualify about data and, therefore, what the appropriate way is to indicate such qualification. A careful analysis would probably indicate that in some situations users do want to identify language while in other cases they want to specify a writing system or perhaps an orthography. Currently, identifiers for such distinctions are typically coined in ad hoc ways.

There are significant improvements with regard to language identification that are needed within IT as a whole. Some of these lie beyond the scope of W3C I18N activity (e.g. addressing problems of documentation in ISO 639). On the other hand, it may well be in scope to address the lack of an ontological model for language-related categories. This would entail a document that provides operational definitions for terms such as "language", "writing system" and "locale" and clarifies the relationship between them, and that clarifies under what circumstances identification in relation to each of these kinds of category may be appropriate or necessary.