Internationalized Resource Identifiers (IRIs)

Internationalization

This page is no longer maintained and may be inaccurate. For more up-to-date information, see the Internationalization Activity home page.

For an introduction to IRIs vs Domain Names, see An Introduction to Multilingual Web Addresses

Internationalized Resource Identifiers (IRIs)

Internationalized Resource Identifiers (IRIs) are a new protocol element, a complement to URIs [RFC2396]. An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO10646). There is a mapping from IRIs to URIs, which means that IRIs can be used instead of URIs where appropriate to identify resources.

The Internationalization Working Group is preparing to submit draft-duerst-iri-0x.txt, by Martin Dürst and Michel Suignard, to the IESG for Proposed Standard. This document is discussed on public-iril@w3.org (public archive).

Background

There is a philosophical problem with all identifiers, particularly obvious in an international context. Identifiers are not only used by computers, but also by humans. Humans have a strong preference for using mnemonic names: you call the address field in a form address and not x12 . Such identifiers are easier to create, easier to remember, easier to understand, easier to guess, easier to transcribe, and easier to identify with. The prime example is URIs: they are machine-readable, but you also find them printed in advertisements and user manuals, and their creators definitely make every attempt to make them easy to remember.

Should identifiers therefore always use English-like words, ASCII letters only? This may make things easier for some people, but harder for others. Is it worth to make things easier for the majority of actual users, while making it a bit more difficult for accidental outsiders? And what if they have to contain information that is originally not in ASCII? URIs created as the result of a form submission are a prime example.

URIs

Internationalization of URIs is important because URIs may contain all kinds of information from all kinds of protocols or formats that use characters beyond ASCII. The URI syntax defined in RFC 2396 currently only allows as subset of ASCII, about 60 characters. It also defines a way to encode arbitrary bytes into URI characters: a % followed by two hexadecimal digits (%HH-escaping). However, for historical reasons, it does not define how arbitrary characters are encoded into bytes before using %HH-escaping.

Among various solutions discussed a few years ago, the use of UTF-8 as the preferred character encoding for URIs was judged best. This is in line with the IRI-to-URI conversion, which uses encoding as UTF-8 and then escaping with %hh:

Guidelines for new URL Schemes, RFC 2718, proposes to base URIs on UTF-8 unless there is some compelling reason for a particular scheme to do otherwise.
URI schemes or components already based on UTF-8:
- URN syntax : RFC 2141 (syntactically, URNs look like a URI scheme, but semantically, they are not)
- IMAP: RFC 2192 (the IMAP protocol uses a modified version of UTF-7, but its URIs use UTF-8)
- FTP ( RFC 2640 uses UTF-8, but tolerates legacy encodings)
- XPointer (W3C Working Draft)

Various document formats already use IRIs:

In XML 1.0, system identifiers are IRIs (see also erratum E26)
In XLink , the href attribute is an IRI
XML Schema provides the anyURI datatype for IRIs
HTML 4.0 , Appendix B.2.1: Non-ASCII characters in URI attribute values (this is supported by all major browsers in newer versions)
The W3C Working Draft Character Model for the World Wide Web proposes to use IRIs in W3C formats in general.

Additional reading:

An overview including motivation and examples: Martin J. Dürst, Internationalized Resource Identifiers: From Specification to Testing, 19th International Unicode Conference, San Jose, CA, Sept. 2001.
A paper discussing UTF-8 and server-side heuristics for transition (PDF/PS): Martin J. Dürst: The Properties and Promises of UTF-8 , 11th International Unicode Conference, San Jose, CA, Sept. 1997.

Domain Names

The IETF WG on Internationalized Domain Names and its mailing list archive. Some earlier work: A proof-of-concept proposal (expired) to internationalize domain names without changing the current DNS. There was also a proposal to use UTF-8 in DNS (draft-skwan-utf8-dns-03.txt, expired). Some experiments with international domain names.

Additional Information

History

An old list of links including pointers to some email discussions (Martin J. Dürst, February 1997).

Where the idea to use UTF-8 in URIs was born: François Yergeau, Internationalization of URLs, September 1996.

Martin Dürst
Webmaster
Last updated $Date: 2011/03/07 12:04:44 $