François Yergeau
Alis Technologies inc.
fyergeau@alis.com
The World Wide Web, despite its name, was not originally conceived as a system able to encompass the whole world; the main limitation is, of course, in the area of text representation.
HTML, the native text format of the Web, is based on a very limited coded character set: ISO Latin-1. URLs, the addresses of the Web, are restricted to not even full ASCII.
But even a quick survey of today's Web finds tens of languages, many using scripts not representable in standard HTML; this is achieved using a variety of kludges, hacks and semi-compatible tricks, endangering interoperability. Unicode, in contrast, offers a clean, more compatible and scalable solution to the problem of representing all of the world's languages.
This paper will describe current efforts to introduce Unicode on the WWW, including standardization and more practical issues related to favoring multilinguism of the Web.
HTML is the lingua franca of the WWW. A large proportion of documents served are marked up in HTML, and the language serves as a platform for linking in other media types. Thus, it is vitally important for a truly world-wide Web that HTML be usable in any language, or even mix of languages.
To that effect, discussions have started in late 1994 in the pertinent IETF working group to extend HTML beyond its original limitation to the ISO 8859-1 character set. This brainstorming led first to a seminal paper by Gavin Nicol, adressing many Web i18n issues, and later to an Internet-Draft titled Internationalisation of the Hypertext Markup Language that crystallized the consensus achieved within the working group.
The main thrust of that effort is the adoption of Unicode, the Universal Character Set, as the document character set for HTML, in place of ISO 8859-1 (a.k.a. Latin-1). In a sense, this is only a partial solution, as the markup itself remains restricted to ASCII; this can be contrasted to proposals such as ERCS, in which meaningful markup can be made up in any Unicode-supported language. But the adoption of Unicode as the document character set makes possible the use of any character set that is a Unicode subset, both in text data and in attribute value literals.
One early benefit of this adoption is that numerical character
references (of the é
variety) become unambiguous,
whatever the character encoding of a document, and furthermore remain valid upon
character encoding conversion (transcoding): they always refer to Unicode code
points.
The HTML i18n Internet-Draft contains other goodies related to
internationalisation, past the mere encoding of text. One is the introduction
of a language attribute (LANG
) on most elements, identifying the
natural language of the element contents. Language tags can be used to control
rendering of a marked up document in various ways: character disambiguation, in
cases where the character encoding is not sufficient to resolve to a specific
glyph; quotation marks; hyphenation; ligatures; spacing; voice synthesis; etc.
Independently of rendering issues, language markup is useful as content markup
for purposes such as classification and language-sensitive searching. The
latter becomes more and more important as the Web spreads around the world.
Even anglophones will soon find their query results unacceptably polluted by
references to pages in foreign languages. That has long since been the
situation for non-anglophones; imagine a teacher launching a search on Paris,
in the hope of finding documents of interest to his German-speaking pupils: the
goodies are buried in tons of chaff!
Various other i18n features can be envisioned that were not put in the draft, or were rejected as too early. One can think of a much larger set of entity references than the current Latin-1, encompassing for instance all the sets standardized in SGML. Markup enabling locale-sensitive rendering and/or form input of date, time, monetary, etc. values would also be valuable. The latter would help prevent one from buying a plane ticket for 08/02/96 when one means 02/08/96, as nearly happened recently to this author.
URLs are a sore point with respect to multilingual support on the Web. These constructs are used to name and locate Internet resources, but are strictly limited to ASCII characters. The defining document, RFC 1738, is very clear about that:
URLs are sequences of characters, i.e., letters, digits, and special characters... In most URL schemes, the sequences of characters in different parts of a URL are used to represent sequences of octets used in Internet protocols... URLs are written only with the graphic printable characters of the US-ASCII coded character set. The octets 80-FF hexadecimal are not used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent control characters; these must be encoded.
The encoding mentionned is the familiar %XX
form that is also
used to encode unsafe ASCII characters such as the space. This leaves
open the question of which characters octets above 127 (7F hex) represent.
The bottom line is that only those whose language can be represented by ASCII (namely, English and Swahili) can have the benefit of meaningful and mnemonic names, a pretty unfair situation.
Technically, the situation is as follows: humans want to deal with characters, so that URLs can be printed on business cards, in magazines and newspapers, and typed on a keyboard. But software wants to deal with octets that can be unambiguously interpreted. There are not many solutions to that problem. One is to have universal agreement on the character to octet mapping; this is the current situation for 7-bit octets, but it leaves most of the world out cold. Another is to carry the character encoding information along with the URL; various proposals have been floated to that effect, but none have carried the day, mostly because of deployment and backward compatibility problems.
One proposal stands out among the crowd, at least in this author's opinion: universal agreement on the UTF-8 encoding of Unicode to represent characters in URLs. UTF-8 has the great redeeming virtue of preserving ASCII as is; in fact, it was designed for that. The effect is that all current legal URLs would remain valid, inasmuch as they contain only ASCII characters.
This apparently presents a backward compatibility problem: what about current URLs containing non-ASCII characters (suitably encoded)? A little experiment now under way using a web crawling robot seems to indicate that there are very few of these, if any. The results at the moment are very, very preliminary, but more conclusive data will be presented at the workshop.
Universal agreement means that there is no need to carry along character encoding information, and Unicode means that everyone gets the benefit of using his own script. This does not yet carry over to the host name part of URLs, however, because that part depends on the DNS, which remains ASCII-only.
At least, HTTP is an 8-bit clean protocol! It transmits body octets transparently, and does not have a so-called ASCII mode like FTP that potentially damages non-ASCII text. Nevertheless, there are a few i18n issues in HTTP.
One is MIME compatibility when transmitting Unicode text without a transfer
encoding, as HTTP allows: the canonical MIME form for text has lines ending in
an ASCII CR-LF pair (octets 0D 0A
), but in Unicode this comes out
as 00 0D 00 0A
. The HTTP working group has agreed to forego
perfect MIME compatibility in that case, but this is challenged.
Another issue is the famous (or infamous) charset
parameter.
The problem is that the protocol does not make it compulsory, and that
implementations almost universally ignore it, leading to serious
interoperability problems. Even worse, there are still clients around that do
not interpret it properly, and fail to recognize perfectly common HTML files
when they are labeled with charset=ISO-8859-1
or charset=US-ASCII
.
There is a vicious circle at work here.
Content negociation is important to internationalisation, but its standardization has recently been relegated to the next version of HTTP, pending the long-awaited adoption of version 1.0. At issue here is the usability of much needed features like character encoding and language negociation; the former is important for interoperability, allowing the server and client to negociate a mutually agreeable character set; the latter is important for multilingual sites, permitting the transmission of the correct language version of a document.
Forms markup, submission and processing encompasses the three areas of HTML, HTTP and URLs, and so is treated separately. The main i18n-related problem with forms is once again the character encoding issue: text submitted from a form should be correctly tagged to insure proper interpretation by the server. Furthermore, text should not be submitted in a character encoding that the server will not understand.
The first aspect is hard to address with the current form-submission
architecture, where text is encoded within an URL. The problem stems, of
course, from the lack of an agreed upon interpretation for non-ASCII characters
in URLs. When the POST
method is used to submit the form, the
URL-encoded data is sent as a body entity, and it would be possible to add a
charset
parameter to the Content-Type
header. This
practice, however, is neither standardized nor widespread, leaving the problem
standing.
RFC 1867 offers an even better solution, originally designed for file upload but permitting precise tagging of multiple body parts in a multipart MIME entity. Its existence as an RFC, as well as its usefulness for file upload, should encourage implementations.
The situation with GET
method submission can only be improved
by allowing submission in a body part (with proper tagging), or by a new URL
standard including character set identification or adoption of Unicode in some
form.
As for knowing what character encoding a server accepts, the HTML i18n draft
has introduced an Accept-Charset
attribute to the FORM
element to that effect; it simply lists the MIME charsets acceptable to the
server.
A multilingual, international and interoperable WWW holds many promises, but also raises a few rather large scale issues. In such a network application, clients may be faced with an overwhelming number of character sets to decipher, and with an overwhelming number of characters to display and process. And users may be faced with an overwhelming number of scripts and languages to read and understand.
Unicode can help a lot to alleviate the first problem, by reducing the number of necessary character encodings to, ultimately, one. However, this is for the long term, if ever; in the mean time, transcoding servers have been proposed that would bridge the gap between limited (in their capacity to handle character encodings) servers and clients. In practice, a server unable to satisfy a client's charset requirements could redirect a request to such a transcoding server; the latter would act as a proxy for the client, but would change the character encoding to something understandable to the client.
As for display, the day when all clients can display all of Unicode is still
far. Thus glyph servers have been designed, again on the model of a proxy, and
to which a server could point in a redirect. The process is heavyweight, but
has the benefit of working today: the glyph server proxy retrieves a document on
behalf of a client, but instead of transmitting it transparently, parses the
HTML text and replaces each data character that the client cannot display with
an <IMG>
element pointing to a server-generated image of the
correct glyph. The client receives the modified HTML, retrieves the numerous
images, and generates a fairly correct display, at the expense of tremendously
increased network bandwidth, long latency and impossibility of adjusting font
size and style and of further text processing (such as simply searching for a
word). Such servers exist. One example is the CIILIB library running over the
DeleGate proxy server, which can be sampled
here.
A better solution, of course, would be a universal browser; second best an adaptable browser, one that could be upgraded on the fly, upon encountering text in a new script.
The last frontier is user relief from this multitude of scripts and languages. This can be approached by the devices of transliteration and machine translation. The former is much easier, and may be sufficient to highlight a few catch words or phrases, justifying further effort (such as obtaining a translation). The latter is more difficult, but the state of the art today is such that translations good enough to carry the general meaning of a text (as well as the occasional good laugh) can be obtained automatically. One can sample a Japanese to English translator in Japan, or pay a visit to the MAITS project home page.