i18n/l10n: URL

Internationalization / Localization

This page is no longer maintained and may be inaccurate. For more up-to-date information, see the Internationalization Activity home page.

URL

URLs typically look like this:

scheme://domain.name/path?query

Example:

http://search.w3.org/htdig/member?words=tim

URLs are currently restricted to some 60 characters (a subset of ASCII), with the rest of ASCII available via escape sequences. URLs can contain 8-bit characters (also via escape sequences), but there is no way to know their character set. The syntax is formally defined in RFC 1738.

Some people have suggested that it would be better if most of the (printable) Unicode characters were allowed. An argument against is that it would be very difficult to print a URL or quote it on the phone if it included more than just ASCII (at least when communicating it to somebody with a different mother tongue).

Currently, servers serving resources with non-ASCII names encode the names, using just the 60 allowed characters. Unfortunately, there is no way for the server to indicate the encoding method used. Martin Dürst has proposed a specific encoding method, based on UTF-8. When used widely enough, browsers could start assuming this method and display the URL in decoded form. Some code (in Java and Perl) is available that shows how it works.

A URL can include a query part, which is added to a URL by the client, when it sends a request that includes the results of an HTML form. In contrast to the rest of the URL, the query part has to be in a form that is meaningful both to the client and the server. Restricting it to ASCII means that the GET method cannot be used for any form that allowed entry of free text. The POST method is an alternative, but some browsers have problems with putting POSTed URL's in history or bookmark lists.

François Yergeau has written an article about URL issues.

Bert Bos, i18n coordinator
Webmaster
Last updated $Date: 1998/01/28 09:08:48 $