W3C International Internationalization

This page is no longer maintained and may be inaccurate. For more up-to-date information, see the Internationalization Activity home page.

For an introduction to IRIs vs Domain Names, see An Introduction to Multilingual Web Addresses

Internationalized Resource Identifiers (IRIs)

Internationalized Resource Identifiers (IRIs) are a new protocol element, a complement to URIs [RFC2396]. An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO10646). There is a mapping from IRIs to URIs, which means that IRIs can be used instead of URIs where appropriate to identify resources.

The Internationalization Working Group is preparing to submit draft-duerst-iri-0x.txt, by Martin Dürst and Michel Suignard, to the IESG for Proposed Standard. This document is discussed on public-iril@w3.org (public archive).

Background

There is a philosophical problem with all identifiers, particularly obvious in an international context. Identifiers are not only used by computers, but also by humans. Humans have a strong preference for using mnemonic names: you call the address field in a form address and not x12 . Such identifiers are easier to create, easier to remember, easier to understand, easier to guess, easier to transcribe, and easier to identify with. The prime example is URIs: they are machine-readable, but you also find them printed in advertisements and user manuals, and their creators definitely make every attempt to make them easy to remember.

Should identifiers therefore always use English-like words, ASCII letters only? This may make things easier for some people, but harder for others. Is it worth to make things easier for the majority of actual users, while making it a bit more difficult for accidental outsiders? And what if they have to contain information that is originally not in ASCII? URIs created as the result of a form submission are a prime example.

URIs

Internationalization of URIs is important because URIs may contain all kinds of information from all kinds of protocols or formats that use characters beyond ASCII. The URI syntax defined in RFC 2396 currently only allows as subset of ASCII, about 60 characters. It also defines a way to encode arbitrary bytes into URI characters: a % followed by two hexadecimal digits (%HH-escaping). However, for historical reasons, it does not define how arbitrary characters are encoded into bytes before using %HH-escaping.

Among various solutions discussed a few years ago, the use of UTF-8 as the preferred character encoding for URIs was judged best. This is in line with the IRI-to-URI conversion, which uses encoding as UTF-8 and then escaping with %hh:

Various document formats already use IRIs:

Additional reading:

Domain Names

The IETF WG on Internationalized Domain Names and its mailing list archive. Some earlier work: A proof-of-concept proposal (expired) to internationalize domain names without changing the current DNS. There was also a proposal to use UTF-8 in DNS (draft-skwan-utf8-dns-03.txt, expired). Some experiments with international domain names.

Additional Information

History

An old list of links including pointers to some email discussions (Martin J. Dürst, February 1997).

Where the idea to use UTF-8 in URIs was born: François Yergeau, Internationalization of URLs, September 1996.


Martin Dürst
Webmaster
Last updated $Date: 2011/03/07 12:04:44 $