23th Internationalization and Unicode Conference, March 2003, Prague, Czech Republic
Martin J. Dürst
An up-to-date version of this paper and the corresponding talk are available at http://www.w3.org/2003/Talks/0324WWL.
The traditional localization model was very much geared towards isolated computations on stand-alone computers. Today's networked world raises new problems, but also provides new chances for innovative solutions. Because cultural conventions have been developed over a long time, they can not easily be abstracted in an uniform way. The talk will discuss various ways to address the problem, such as identifiers, data formats, and dynamic data exchange (e.g. using Web services), and how they can be structured to work together.
The World Wide Web has affected computing and communication in very fundamental ways. Applications have been moving from a closed world to an open world. This affects localization in very fundamental ways. This paper discusses the various aspects of localization on, with, and for the World Wide Web.
The traditional locale model, most prominently present in [Posix], makes a large number of assumptions and restrictions that can not be upheld in an internetworked environment. It is important to understand these restrictions: there is one locale per user, application, or computer; a locale is responsible for a fixed number of things, and these are strongly linked together; it is assumed that all the data and functionality necessary for a particular locale is available locally; and creating and installing locales is a task requiring considerable system expertize and privileges. Some of these restrictions have been relaxed in newer versions of the traditional locale model, but usually not without considerable pain for implementers.
Most if not all these assumptions have to be given up when moving to the open world of the World Wide Web. This leads to quite a bit more complexity and uncertainty. However, being in an open world not only has disadvantages, it can also help. Web technology has had to solve several general problems that can very beneficially be applied to localization. One typical example are URIs (Uniform/Universal Resource Identifiers, [URI], see also [IRI]). They allow to identify and address information of any kind (not only Web pages!) over the network. Another excellent example is [XML], the Web's generic format for structured documents and data.
The openness of the Web raises many different questions. Where are some operations, in particular localization operations, done? Who decides on preferences, and how are they communicated? What kind of fallbacks are available in case some part of the network is temporarily unaccessible? Who has the expertise for appropriate localization, and how can it be transferred to where it is needed? Can the Web help to make localization better and more flexible, and make localized applications available to more languages and cultures?
Work on localization over the years has developed various general principles. For the World Wide Web, some new principles may need to be added.
This is a principle that is extremely well established, and is often
worded as 'separate data and presentation'. On the World Wide Web, and with
XML which is primarily text-based, a completely presentation-independent,
e.g. binary, representation was not feasible. However, [XMLSchema] has established representations that are a
good compromise between readability and abstractness. With the use of
Western-Arabic numerals and the Gregorian calendar, the actual
representations are somewhat geared towards the Western hemisphere. However,
in most cases, the underlying data that is being represented (e.g. a number,
a date) is culturally independent. The exceptions are the datatypes starting
gDay. This represents e.g. the 5th
day of each month in the Gregorian calendar, a concept that cannot easily be
represented in other calendars. These datatypes should never be used in
This principle is similar to the last one. Localization close to the user can help to make sure that the necessary resources are available, and that the right preferences are chosen. Transmitting user preferences, in general and in the specific case of localization-related preferences, is a problem that is not yet very well understood. Proposals for solutions range from the simple use of language identifiers (see [RFC3066]) to structured formats such as [CC/PP].
One goal of localization is to get as close as possible to the end user to make their interaction with the computer as seamless as possible. This means that for example for date formatting, we want to make sure that the user sees dates formated in exactly the way she has been used to. However, on the World Wide Web, we are not completely sure who will see our data.
It is therefore important to build in the necessary redundancy. A typical
example is date formats. In a well-defined context,
be clearly interpreted. But in a global context, it is highly confusing. Is
this February 3rd, 2004, or March 2nd, 2004, or March
4th, 2002? This means that on the network, we better build in some
redundancy, and use a more explicit date format.
This example is also a typical example of how technology influences culture. While we try to avoid this, it is not always possible or advisable.
The purpose of these examples is to show that the various aspects once lumped together as a locale have very different properties when moving to an open network, but that the World Wide Web can provide unique solutions in each case.
Let's start with a rather easy example, namely character encoding. Why easy? For decades, character encoding was one of the biggest obstacles of interoperability between computers. Character encoding was also a major component of a locale definition in the traditional model. But thanks to Unicode and XML, it has been largely solved. XML allows documents to indicate the character encoding they are using, in a very simple and straightforward way. Also, every XML processor is required to process UTF-8 and UTF-16. Across XML technology, Unicode is used as the common reference; the motto is "Think in Unicode".
Although it is still possible to transmit data in non-Unicode (legacy) encodings, in most cases, it is much easier and more straightforward to to use an Unicode-based encoding. This keeps the character encoding aspect of the document or data independent from any locale considerations. It also avoids or reduced the remaining inaccuracies in definitions of character encodings (see e.g. [JapXML]).
Sorting, collation, and related operations such as database queries with comparison operators are the typical example of functionality that is difficult to move close to the client. The reason for this is that it usually has to deal with a large if not huge amount of data, and requires serious processing power. Downloading a whole data set to a client just to extract a few items in a culturally adequate way is not feasible from a resource viewpoint, and may also raise serious privacy issues. Even on the server side, on-the-fly sorting according to various cultural preferences is often very slow or infeasible, and the only choice may be the sorting order for which an index was built up.
Many problems in internationalization and localization require a staggered approach. Because cultural conventions have been developed over a long time, they can not easily be abstracted in an uniform way. Any kind of simplification, however well thought out, risks to work for a large number of cases, but there are always cases that need special treatment. Date formatting is a very typical example.
The easiest solution to date formatting is to just display a standard [ISO8601] date, such as
author does not know any locale where this format is used as such. But the
percentage of people that will be able to decode it correctly with only
moderate effort is large, the chances that somebody misinterprets it are
small, and the effort needed is minimal.
Of course just displaying a standard format is not what localization is
all about. A better solution is based on the observation that a large part of
the world is using the Gregorian calendar, in particular for business use.
Localized formatting can be achieved by appropriate reordering of the year,
month, and day parts, conversion of numbers to month names, and changing the
separator texts. This leads to formats such as
March 24, 2003.
It is quite straightforward to define a format containing all the necessary
information for a particular task such as date formatting (see e.g. [LDML]), as long as the format is based on the Gregorian
calendar. A particular format can then be identified by an URI to the
relevant data. In some cases, it is also feasible to provide the data needed
for formating inline, see e.g. the
element in [XSLT]. Unfortunately, in that specific case,
the format cannot be identified directly with an URI.
However, there are various calendars in use around the world for which a
simple list of month names, etc., is not enough. These formats require actual
calculations of various degrees of complexity. In many cases, the code needed
will be installed locally. But what can be done if this is not the case? Data
travels much easier over the network than recipes for calculation, because
there is no universal programming language, and the execution of downloaded
programs has various security problems. However, the World Wide Web allows
calculations to be done remotely. For example, it would not be too difficult
to create a script on a server that returned a date formatted in the Hebrew
calendar when being accessed with
Again, the format is identifier as an URI. In general, the concept of
accessing remote functionality and exchanging messages from computer to
computer over the World Wide Web is called Web Services [WebServ].
It is important to realize that the above two approaches, obtaining the
data defining the format and requesting remote formatting, can easily be
integrated. We can identify any format by an URI and can access it with a
query part of
?date=2002-03-24 attached. If we get e.g. an LDML
document, we know we have to do the actual formatting ourselves. If we get
the actual formated date, we are already done. To make sure we can identify
formated dates, we can define a very simple XML format for formatted dates,
e.g. something like
format goes here
It is also important that we do appropriate caching, i.e. that we remember data-based definitions and actually formatted dates rather than retrieving the same information over and over via the network. Due to the locality (in a double sense) of processing, this will increase efficiency dramatically. Of course, this functionality should be abstracted so that applications can just obtain a formatted date from an URI and a locale-neutral date.
Ignoring all the implementation details, we can describe the formating operation as having an URI and a date as input and a formated date as output.
There is one more aspect with date formating, and that is that certain calendars are not completely algorithmic. There is no way for example to predict what the name of the next era in the Japanese imperial calendar might be, or when it might start. As another example, the start of months in the Islamic calendar is often determined by actual observation of the moon. On a single computer in a closed world, this aspect of calendars is impossible to handle. The network can offer a solution via protocols similar to [NTP].
Some details that are not discussed in this section but that also need to be addressed are the details of date correspondence (e.g. the Gregorian day starts at midnight, whereas the day in the Hebrew calendar starts at sunset), and the long term stability of algorithms and of the Gregorian calendar itself.
Monetary formatting is probably the most egregious example of a component in the old localization model that has to be treated completely differently on the World Wide Web. The closed model assumed that a single user/application/installation would only use a single currency. Knowledge about the currency being dealt with was therefore implicitly built into the formatting operation rather than being explicit in the data. Even for purely local installations, the Euro conversion showed how bad such implicit assumptions were.
On the World Wide Web, monetary amounts and currencies can be handled in a more appropriate way. First, we can chose not to convert, making sure that the user sees the correct amount at which e.g. a product is offered. This raises the question of cross-locale monetary formatting. Rather than choosing the format used in the culture of the original currency or the format customary in the user's locale (with the currency symbol substituted), it is important to use an unambiguous designation. This designation can either come from ISO 4217 (e.g. USD, CHF) or be another customary designation (e.g. US$, SFr.).
The second level is to provide an estimated conversion of the amount into the user's preferred currency, clearly marking this as an estimate. An alternative that is often used is to provide a link to a conversion service.
With the improved integration of services on the World Wide Web, it should be possible in the future to offer real-time guaranteed conversions, i.e. to offer the user an option to be billed in her preferred currency at a conversion rate that is guaranteed by a third party. In other words, as soon as the purchase is made, the third party exchanges the amount from the user's currency to the seller's currency.
Translation of actual text, traditionally called message translation, is in many ways different from the examples looked at above. The range of texts needed is open-ended, and translation takes a lot of time and money. There are many arguments for doing this both on the server and on the client side.
Let's start with the client side. The closer to the client conversion to actual text (including other related formating operations) happens, the easier it is to adapt to the user's language and cultural conventions. It has a higher chance to provide the language of the user's choice, which may not be available on the server. However, while doing localization on the client side is the goal, it may not always feasible. Very light weight clients may not be able to carry large message catalogs. Abstracting and transmitting various kinds of error conditions from failures of the underlying infrastructure may not easily be possible. And doing translation on the client side may mean that the same text is translated repeatedly independently, which is a waste of resources.
As an aside, it is also worth mentioning that Web technology is helping to streamline the actual translation process (see e.g. [TMX] or [XLIFF]).
This paper tried to show how the change from local computing to networked applications and the World Wide Web fundamentally affect the problem space and the potential solutions for localization. The World Wide Web, with URIs, XML, and Web Services will provide the building blocks for new flexible and innovative solutions. This is an active and exciting area of work where you can participate and contribute, e.g. in the Web Services Task Force of the Internationalization Working Group at W3C [WSTF].
All opinions and errors in this paper are purely those of the author. I am grateful to many of my colleagues, and in particular to the members of the Web Services Task Force of the Internationalization Working Group at W3C [WSTF] for inspirations and discussions.