WInter Group                                                         M.T. Carrasco Benitez
{NOT] INTERNET-DRAFT
<draft-ietf-winter-01.html> 
Expires {month} {day}, 1996                                          {month] {day}, 1996

WInter (Web Internationalization & Multilinguism)

Status of this Memo

This document is {NOT} an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress".

To learn the current status of any Internet-Draft, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast).

Distribution of this document is unlimited. Please send comments to the WInter group at <winter@dorado.crpht.lu>. Information about the WInter group, including subscription details are in the WInter Page at:
http://www.crpht.lu/~carrasco/winter

Abstract

This document discusses the Internationalization & Multilinguism of the Web. A Web capable of supporting different cultures, natural languages and Language Engineering facilities such as Parallel Texts. Internationalization permeates most subsystems: client, transmission, server, data and authoring; the primitive mechanism for WIntering should be part of the Web foundations.

Table of Contents

Introduction
Terminology
{needs work}
{numbering later}

Introduction

An Internationalized & Multilingual Web should have the traditional facilities of Internationalization and more advanced facilities needed for Language Engineering. For example, clients should have a language menu (similar to edit or file menu in Windows) that shows in which other linguistic versions the currently displayed document is available; or clients should be capable of displaying and moving in sync side by side, two linguistic versions of the same document.

The intention of this document is considering all aspects for WIntering. To a very large extend, it puts together the efforts of other groups. It goes in more details when materials are not covered elsewhere.

Another noteworthy characteristic of this manual is that it doesn't always tell the truth. When certain concepts of TEX are introduced informally, general rules will be stated; afterwards you will find that the rules aren't strictly true.
The TEXbook
Donald E. Knuth
The above quote particularly applies to the other documents. Though the intention is to make this document selfcontained by resuming or quoting other documents, it is strongly recommended to consult the source documents.

Writing Style

A special effort should be made to make this documents as accessible as possible to non-computer specialists (e.g., linguists) and non-English native speakers. Due to the characteristics of WInter, there should be a significant number of both. This does not imply that there should be one type of document for each type of participant. It means that this document should be accessible to all participants. Perhaps by adopting a journalistic style and re-stating the evident. The overhead should not be too big and it is good to avoid misunderstanding, even between people of the same field.

Comments regarding the writing style from journalists or readers with similar profiles are very welcome; i.e., not computer specialists that have to explain computer materials to other non computer specialists. Some of the suggestions could be what additional material should be included to make this document more selfcontained; and what terms should be replaced to make it more accessible. But, the gory normative details must be present.

Terminology

  • Alignness
    It is a quality of Parallel Texts; for example, the Treaty of Rome in English and Spanish are Parallel Texts and they should be aligned. The interesting part is aligning Parallel Texts automatically.

  • Author-Translator-Publisher Chain (ATP-chain)
    It refers to the integration of all the phases in the production of documents. Usually, large distributed systems.

  • Language Engineering
    Language Engineering is the application of informatics to natural languages. In particular:

    • Terminology
    • Translator's Memory
    • Multilingual documentary databases
    • Aligned Text
    • Translator's Workbench
    • Author's Workbench
    • Machine Translation
    • Publishing (in particular, multilingual synchronized publishing)

  • Level of Alignness
    According to which depth it is possible to identify the Linguistic Objects, the texts are aligned at:
    • Document level: the trivial case; i.e., Parallel Texts.
    • Paragraph level: not too hard.
    • Sentence level: desirable and possible.
    • Term level: it needs tagging.
    • Token level: it needs tagging.

    In this context, sentence is a part of a text delimited by a dot, semicolon or similar; i.e., it has little grammatical meaning and the main interest is to identify Linguistic Objects.

  • Linguistic Object
    Linguistic Object is a unit of language representation. It can be a fixed language representation (term, abbreviation, title, segment, phrase, paragraph, etc) or meta-language representation (a grammatical construction, etc). More general, a Linguistic Object is a discrete linguistic unit (usually a string) whose meaning is created by the program treating it.

  • Multilingual Aligned Text (MAT)
    A MAT is a record with one Linguistic Object per language field (English, Spanish, German, etc) that are the equivalence (usually the translation) of each other. There are other fields for classification and other purposes. MATs constitute independent elements of a table; i.e., there is no ordering in the table. The end result is a data structure similar to a multilingual dictionary.

  • Parallel Texts
    Parallel Texts are linguistic versions of the same text; for example, the Treaty of Rome in English and Spanish are Parallel Texts.

  • WInter
    It stands for Web Internationalization & Multilinguism.

  • Unicode Glossary

    Character Set

    Preliminaries

    A large character set is a basic prerequisite for having Internationalization & Multilinguism. The bottom line is that the Web must be capable of handling Unicode [UNICODE].

    The character set should be considered a low level layer; i.e., like the pieces of wires in the seven layers model. Other functionalities should be built on top. There is a tendency in overloading this layer, by opposition to defining new layers.

    There are two aspects to the character set:

  • The Back office
    It deals with storage in disk, transmission, representation in the document, etc

  • The Front office
    It is concerned with rendering on the screen or printer.
  • Back office

    Latin-1 ( ISO 8859-1) [ISO-8859-1] is the default character set for the Web. Latin-1 is only sufficient for Western European languages. Latin-1 is an 8-bits encoding. This permits a maximum of 256 characters.

    Unicode (ISO 10646 BMP) is a large character set that includes most of the world languages. Unicode is a 16-bits encoding. This permits over 65,000 characters. At present, over 25,000 positions are still free. This form is also called UCS-2; i.e., Universal Character Set 2-bytes. Unicode is the first plane of ISO 10646 (see below); this plane is also called BMP (Basic Multilingual Plane) or Plane Zero.

    ISO 10646 is a 32-bits encoding. It is divided into 32,000 planes, each with 65,000 characters capacity. This permits 2,080 million characters. This form is also called UCS-4, Universal Character Set 4-bytes. Only the first plane (Unicode) is in use.

    UTF-8, (Universal Character Set Transformation Format) is part of ISO 10646. In brief, it could be considered a "compression" mechanism. It allows to use the 32-bits encoding of ISO 10646, but using only the bytes needed for storage or transmission. For example, an ASCII character is represented by 1 byte (8 bits) and not 4 bytes (32 bits). There are additional advantages.

    The Internationalization of the Hypertext Markup Language (I-HTML) [I-HTML] proposes Unicode as the document character set.

    HTTP-1.1 [HTTP-1.1] allows for the character set to be negotiated. For example, the client and server can agree on using Unicode.

    Front office

    Rendering is drawing the glyph (graphic representation of the character) on the screen or printer. This is the job of the browser and the browser depends on the graphical facilities of the computer.

    Undisplayable characters are the characters that cannot be displayed due to the lack of facilities. The I-HTML "does not prescribe any specific behavior", but notes some "considerations". WInter recommends the following:

    The guide lines for the behavior undisplayable characters should be further refined. Font Servers could supply the browser with missing glyphs. Perhaps combined Millicent [MILLICENT] type payments.

    Internationalization & Localization

    Internationalized softwares are developed without the cultural characteristic embedded. They can be localized parametrically for different cultures; for example, the same software can run for Germany with the German conventions, or for Italy with the Italian conventions.

    Internationalization is a well known field. For example, a significant amount of effort was done during the POSIX standardization. Most conventions have been agreed already; e.g., how to represent the date in Germany.

    Any number of cultures (real or imaginary) are possible. For example, France, Germany, European Commission. In the case of the European Commission, it has to work in the eleven official languages (including Greek), and with cross-cultural conventions or with the national conventions.

    The elements of localization

  • Languages
    Two aspects:
    • Language strings in the software.
    • Data in the document.
    Example, the software could be in German and the document shown in French.

  • Sorting order

  • Number representation
    Example, the internal number could be 12345.67 and the external representation could be 12,345.67 or 12.345,67.

  • Date & Time
    Example, the internal representation could be 19951231 and the external representation could be December 31th 1995, or 31-12-1995.

  • Short quotations
    Example,
    • "I am a Berliner" (English)
    • <<Je suis un Berlinois>> (French)
    • ,,Ich bin ein Berliner'' (German)
    The new element <Q> in I-HTML is for this purpose.
  • New internationalization elements should be added to this list, for example, color.

    The software should be localized from a list of prefered localization, and switchable from one localization to another without re-starting the application.

    {needs work; BIDI in I-HTML}

    Multilinguism

    Multilinguism deals with advanced language facilities, often several languages simultaneously. It is also referred as Language Engineering. This comes from the tradition of specialized software for Language Engineering, such as Translator' Workbench. One of the main applications is the processing of Parallel Texts.

    Most of the softwares in Language Engineering are incompatible and there are practically no standards in this field. Usually, researchers or vendors start from scratch and developed all the modules; even horizontal modules such as user interfaces and data structure and storage, rather than concentrate in the engines for language processing (for aiding the translator, machine translation, etc).

    One of the main inmediate objective in Language Engineering must be the creation of standards that clearly separates data and software; i.e., it should be possible to adquire a translation aid from one vendor and the dictionaries from another vendor.

    The purpose is not making every browser a Translator' Workbench, though browsers could do with more advanced language facilities that are usually found in internationalized products. But the standards must allow the construction of Translator' Workbenches based on the Web technology.

    After security and the application for secure payment over the Internet, Language Engineering is one of the applications most relevant from an economical point of view; in intranets, with less security requirements, it is probably the most important. It is as horizontal as publishing and, indeed, it is the second phase in the ATP-chain (Author-Translator-Publisher). Translating is expensive and very human intensive. For most texts, machine translation is not acceptable. On the other hand, translating aiding tools are very cost effective. Particularly, if integrated in an ATP-chain. Saving in translating tends to be big.

    Parallel Hypertext

    Parallel Hypertext is an extension of the hypertext paradigm to natural languages. For example, a user looking at a document in English should be able to obtain the Spanish translation in a transparent way; i.e., just by selecting the Spanish option in a language menu and not by selecting an anchor embedded in the English version. For this, the Web must know about languages; i.e., the same in another language. The same property of alignness in Parallel Texts applied to Parallel Hypertext.

    Language Tags

    The language tags (see 3.10, HTTP-1.1) are composed of a primary language tag and one or more subtags that could be empty.
    Examples:
    
    en
    en-US
    en-cockney
    
    There must be a way to indicate This could be part of a subtag or inside the document. This has to be refined.

    Document request

    Clients should be able to request documents at least in the following ways:
    1. A document is requested according to a preference language list that could be the same list used for choosing the display labels in the user interface. The server must respond with best linguistic version and the list of available linguistic versions. The best linguistic version means the nearer to the top of the list and if none is available, the nearer to the top of the defaults in the server. In this case, the browser probably does not know what are the available linguistic versions.
    2. A document is requested in one specific language. The server must respond only with that linguistic version (no other is acceptable) and the list of available linguistic versions. In this case, the client probably knows that the requested version is available; it could be the result of a previous conversation with the server.
    Example: The linguistic versions of the document could be in different servers.

    This could be done with the Accept-Language and Content-Language facilities (see 10.4 and 10.11, HTTP-1.1).

    The parameter in Accept-Language:

  • Quality factor "q" is decribed as "...estimate of the user's comprehension of that language. But the user indicates his language preference list and there is no need to use the parameter with this meaning. It would be more usefull to indicate the minimum acceptable quality of the translation. Some of the translation could be done by more or less experienced translators; or machine translation.

    A different usage could be to indicate the level of alignness.

  • Maximum acceptable size "mxb" is not used. It could indicate the number of linguistic versions desired.

    An Accept-Language with a single language parameter must mean that the browser only wants that linguistic version and not another.

    The Content-Language "...describes the natural language(s) of the intended audience ...". The meaning of this field should be "the list of linguistic versions available; it should be used by the browser to update the language menu, so the user could know which other linguistic versions are available.

    Parallel Hypertext Data Structure (PHDS)

    One Parallel Hypertext Data Structure contains all the information for one Parallel Hypertext Document. The Parallel Hypertext Data Structure must allow the following:

    The Parallel Hypertext Data Structure has two parts:

  • The PHDS-Header
    Contains administrative data. For example, where is the German linguistic version. The data is divided into structured fields.
  • The PHDS-Body
    Contains the linguistic data. It has one section per language.
  • The PHDS-Header is a HTML file. This file must fulfill two functions:

    The PHDS-Header must contain at least the following information: {needs work; give an example of a file in HTML}

    The default for a single set of files is:

    DocName.html                                     (PHDS-Header)
    
    DocNameDir                                       (PHDS-Body, a directory)
               /en.html                   English    (PHDS-Body language section)
               /es.html                   Spanish    (PHDS-Body language section)
               /de.html                   German     (PHDS-Body language section)
    
    
    The default for several sets of files is:
    
    DocName.html                                     (PHDS-Header)
    
    DocNameDir                                       (PHDS-Body, a directory)
                    /en/DocName1.html      English   (PHDS-Body language section)
                    /en/DocName2.html      English   (PHDS-Body language section)
    
                    /es/DocName1.html      Spanish   (PHDS-Body language section)
                    /es/DocName2.html      Spanish   (PHDS-Body language section)
    
                    /de/DocName1.html      German    (PHDS-Body language section)
                    /de/DocName2.html      German    (PHDS-Body language section)
    
    The DocName.html should be usable directly by the present clients (browsers) and/or indirectly to generate html files of the fly. Multilingual clients should use the information to access the documents in a transparent way.

    Requesting a URL of a PHDS-Header must get the linguistic version according to the rules of the languages preferences. Requesting a URL of a PHDS-Body language section must get that linguistic version.

    The server must know at least the following defaults:

    {needs work}

    A standard data structure for Parallel Hypertext would be of use for anybody working with Parallel Texts, independently if the Web is used or not. For example, CD-ROMs could be published with Parallel Texts for language processing programs, such as Machine Translation, that would know what to expect. At present, there is no standard for Parallel Texts or MAT.

    {needs work; relation to TEI}

    Anchoring Strategy

    The anchoring strategy must minimize the maintenance. This is essential for large multilingual documentary databases. For example, the millions of pages of the European Institutions in eleven languages. Only one linguistic version should have explicit anchors; i.e., the anchors as used today that are physically present in the documents. The other linguistic versions would have implicit anchors; i.e. anchors that would not be physically present in the texts, but they could be calculated by the alignness of the different linguistic versions.

    The generation of implicit anchors could be client, server and/or authoring affair:

    These options should be considered as a continuum and (some) are not mutually exclusive: most degrees between the extremes are possible. For example, servers could be able to create documents on the fly and they could be using documents with the anchors generated by authoring systems. Indeed, a mixture could be the most probable case.

    The level of alignness should be calculated in advance and kept in the Parallel Hypertext Data Structure. I verified documents widely regarded as aligned because they were revised over half a dozen time and that they have been heavily used for decades (best-case documents); once submitted to a computer program, it came to light that they were not aligned even to paragraph level.

    The anchored text (i.e., what goes between <a ...> and </a>) would have to be at least to the level to which the texts are aligned. For example, for texts aligned only at paragraph level, it is not possible to calculate implicit anchors at sentence level. A corollary is that texts aligned at document level can have implicit anchors only at the beginning or at the end.

    The anchors would have to be at least at sentence level. It would be hard to place implicit anchors in part of a sentence without tagging: the second text should have null anchors; named null anchors if there are several in one sentence.

    Examples:

    Generation of parallel texts

    The linguistic versions could be generated through machine translation or other techniques. For example, a system could have documents in Spanish and a program for translation to English. The user should be informed by the language menu into which language and with which technique the documents are available.

    {needs work}

    Dragoman

    This section is included mostly to illustrate the kind of applications for multilinguism.

    Dragoman is a reference model for Language Engineering. It uses Multilingual Aligned Hypertext technique. In essence, Dagroman describes a database (part structured and part documental) and Services that can be implemented over a (multilingual ) Database. The Web paradigm is particularly well adapted to Dragoman. The term Dragoman has nothing to do with dragons; it means language interpreter.

    What follows is a very brief description of some of the Services that could be implemented over the Database. There could be several programs offering the same Service. Services processing whole documents could be implemented in batch; particularly if they are using a very large Database (several gigabytes).

    Interactive Search

    Selects the Multilingual Aligned Texts that match a search criteria. The search is fuzzy (e.g. 87% match). Unfound requests are valuable information that must be processed further. The system must keep trace of the unfound requests to put in contact people with similar needs (matchmaker); the user must decide what is a typing error and what is a genuine unfound request. Also the user can send messages to terminologists (demand driven terminology).

    The Translation Folder (full preprocessing)

    The objective is to obtain a complete Translation Folder for a given document. Hence, the translator should not need to consult dictionaries, databases, glossaries, nomenclature list, etc. It is like having a hundred assistants preparing the text for the translator. In a typical Translation Folder, some paragraphs should be fully translated and some paragraphs should be a mixture of full sentences, segments, titles, terms, nomenclatures, etc (all these items are packaged as Linguistic Objects); background documents could also be taken into account. The Linguistic Objects are marked with the Status; for example, unverified, verified, compulsory, etc. The search follows a fuzzy biggest chunk heuristic. Traditionally there are two texts, source and target. But there could be any number of language fields. This could be the most useful Service for the translator and it should be implemented early. The translator could use the result on paper or on the screen.

    Preprocessing for Machine Translation

    Similar to the Translation Folder. It should be adapted to an (existing) machine translation program that follows up the processing. For example, select only exact matches (no fuzzy) and terms in the unfound phrases; the machine translation program would translate only the unfound phrases.

    Machine Translation

    A Machine Translation program that uses the Database directly. For example, a program could combine perfect matches, process the easy fuzzy matches such as dates, pure Machine Translation, etc.

    Pseudo-Automatic Translation (PAT)

    Similar to the Translation Folder, but where all the texts are found with a 100% match (no fuzzy search). The program should be restricted to a collection of records; i.e., it should not be allowed to roam the database as there could be bad surprises. In particular, one must avoid word by word translation; hence one must be very careful with small Multilingual Aligned Texts (for example, a one-word Multilingual Aligned Text).

    {needs work; terminological databases} {needs work; multilingual thesauri for searching}

    Document Generation

    All the linguistic versions of a document are generated camera ready. There is no source and translation as such, the index is created, the typesetting (nearly) done. This is the most useful Service for the Organization. It is a very efficient way to produce documents. The three phases Author-Translator-Publisher (ATP-chain) are highly integrated. It is particularly adapted to periodic publications. The production of standardized documents is trivial.

    Documents in several linguistic versions are often required to be synchronized; i.e., each page in each linguistic version must contain the same content and the same lay-out (text, number of paragraphs, etc). The typesetting, including the synchronization, must be automated and each page should not be processed by a human; a human operator should intervene only to fine-tune the publication. TeX should be considered.

    A document might need several representations; for example, typesetted for the Official Journal and formated for a CD-ROM. First, a document in SGML should be generated; indeed, the SGML document is the document. All the following representations should be created from the SGML document. This method should guarantee that all the representations have the same content.

    With such a system in place, the creation of secondary products is easy. For example, a Parliamentary Commission could work with a draft of the Budget typesetted like the Official Journal, in all the linguistic versions, enriched with hidden comments.

    Document Comparison

    The user directs the program to a document similar to the one that has to be translated. The new pieces could be fetched in the Database. This program could work without the Database, though the new pieces would not be fetched. Similar translations could arise as a version of a previous document and as a new similar document.

    Author's Workbench

    Authors could use a similar technique to Translation Folder and Document Comparison. The unknown parts of the text would be marked and in certain cases alternatives would be proposed. Texts created with the translation phase in mind are easier to translate. Ideally, the author should aim to produce a text for translation with Pseudo-Automatic Translation.

    Terminology Verification

    The objective is to verify the Consistency and Harmonization of the terminology. The concepts are closely related and they can be combined, but they are not the same.

    Multilingual Aligned Text Editor

    An editor shows at least two (aligned) texts, it moves the texts in sync, it highlights the differences, etc.

    Printing

    A program that prints one or several Multilingual Aligned Text side by side. It could be the following step after the Translation Folder. Multilingual Aligned Texts (source and target) on paper allow the translator to use traditional tools such as dictating.

    Acknowledgments

    This document makes heavy use from the documents cited in the texts. Particularly from the relevant RFC and IETF-drafts.

    Also from the following:

    In such fluid circumstances, it is nearly impossible to attribute credits. Though it particularly comes to mind,
    Martin Bryan
    Martin Dürst
    Albert Lunde
    Larry Masinter
    Gavin Nicol
    Christine Stark
    François Yergeau
    

    Bibliography

    [MILLICENT] S. Glassman, M.Manasse, M. Abadí P. Gauthier, P. Sobalvarro. "The Millicent Protocol for Inexpensive Electronic Commerce." Fourth International WWW Conference. Boston. December 1995.

    [HTTP-1.1] R.T. Fielding, H. Frystyk Nielsen, and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", Work in progress (draft-ietf-http-v11-spec-01.txt) , MIT/LCS, January 1996.

    [I-HTML] F. Yergeau, G. Nicol, G. Adams, M. Duerts, "Internationalization of the Hypertext Markup Language" Work in progress, (draft-ietf-html-i18n-04.txt)

    [ISO-8859-1] ISO 8859-1:1987. International Standard -- Information Processing -- 8-bit Single-Byte Coded Graphic Character Sets -- Part 1: Latin Alphabet No. 1.

    [UNICODE] The Unicode Consortium, "The Unicode Standard -- Worldwide Character Encoding -- Version 1.0", Addison-Wesley, Volume 1, 1991, Volume 2, 1992.

    {needs work}

    Author Address

    Manuel Tomas CARRASCO BENITEZ
    carrasco@innet.lu
    Fax +352 467302

    {Copyright M.T. Carrasco Benitez 1996}