19th International Unicode Conference, September 2001, San Jose, CA

Internationalized Resource Identifiers:
From Specification to Testing

Martin J. Dürst
W3C/Keio University
mailto:duerst@w3.org
http://www.w3.org/People/Dürst

Keywords: Uniform Resource Identifiers (URI), Internationalized Resource Identifiers (IRI), UTF-8

Abstract

Uniform Resource Identifiers (URIs) are a core component of the Web. Internationalized Resource Identifiers (IRIs) are equivalent to URIs except that they remove the limitation that only a subset of us-ascii can be used. Conversion between IRIs and URIs is based on the UTF-8 character encoding followed by %-escaping. This matches well with an increasing number of URI schemes and components that use UTF-8 as their encoding. This paper discusses URI internationalization in detail, including motivation, architecture, specifications, and testing.

1. Introduction

This section discusses the motivation for the internationalization of URIs and gives some basic introduction to URIs, their properties, and their components. Uniform Resource Identifiers (URIs) [RFC2396] are one of the three basic components of the original World Wide Web architecture (the other two being HTTP and HTML). URIs are the glue of the World Wide Web, they are used to identify virtually everything of importance from Web pages and services to email addresses, telnet connections, and telephone calls.

Motivation for Internationalized Resource Identifiers

On average, URIs use a mixture of readable parts and syntax that is cryptic at least at first glance. For example, this paper will be available at http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html. In this example, the reader may be able to correlate several of the components with the date of the talk, the conference name, the title, and so on. Some of these correlations may be wrong, or may go unnoticed. But the experience with URIs over the last few years, as well as with many other kinds of identifiers, shows that there is a continuing desire for people to make use of such correlations. In particular, such correlations are useful for the following purposes [Dür97a]:

All these operations are much easier if people can use their native script. This is a very clear motivation for making sure URIs are appropriately internationalized. In addition, URIs may contain query parts, where it is important that characters can be sent to the server reliably (see Section 4).

Properties of URIs

The basic property of URIs is that they are identifiers, i.e. they stand for something else. The 'something else' is called the resource, and the process of obtaining the resource is called resolution. URIs have a number of additional important properties. The properties most important in the context of this discussion are uniformity and transcribability. For completeness, this subsection also shortly discusses universality and the distinctions between URLs and URNs.

Uniformity refers to the fact that certain syntactic conventions are associated with certain operations for all URIs. As an example, the characters '#' or '/' always have the same function whenever they appear in an URI. This does not mean that every URI has a '/', or that all URI schemes allow '/', but it guarantees that the operations associated with '/' characters in URIs can be executed uniformly for all URIs. A thourough discussion of the importance of uniformity for the current and future operation of the Web can be found in [Gettys].

Uniformity was in some cases used as an argument against URI internationalization. Using a small and uniform set of characters would allow any URI to be used by anybody, on almost any type of device. However, many URIs are predominantly used by people knowing a particular script, and it is much better to optimize these URIs for these users than to optimize it for the remaining small minority that is not familliar with the script.

While the above discussion applies to the final form of URIs, uniformity is definitely very important when looking at character encoding issues. As Section 2 will show, this unfortunately has not been recognized from the start.

The second important property of URIs in the context of internationalization is that they are not only used inside digital systems as protocol elements, but are also used on paper as well as in people's minds. These kinds of transcriptions are important for URI internationalization in various ways. As said above they are one of the main motivations for internationalizing URIs to the point where a wide range of characters can directly be used.

URIs are also known as Universal Resource Identifiers. This refers to the fact that anything of importance can be given an URI, and any existing system of identifiers can be subsumed by URIs. It does not mean that URIs are the only such system possible, but it is currently the most visible, successful, and widely used one.

URIs are often partitioned into URLs (Uniform Resource Locators) and URNs (Uniform Resource Names). Internationalization of URIs is orthogonal to this distinction, and so only a very short summary is given here. Depending on the context of the discussion, this is done in at least three ways. First, in an abstract sense, there is an attempt to make a distinction between names and addresses (locations). This works very well for physical entities such as human beings or books in a library, but gets heavily blurred in the case of a digital network with numerous indirections and caching mechanisms. Second, in an intentional sense, URNs are often positioned for more persistent use. Third, in a syntactical sense, URNs are distinguished as those URIs that start with the prefix (scheme name) urn:.

URI Components

URI syntax is defined so that various parts of an URI can be clearly identified if present. First, according to [RFC2396], what goes into the href attribute of the <a> element in HTML and similar places is called an URI Reference. This includes the part after the #, the so-called fragment identifier. For [RFC2396] and specifications referring to it, only the part before the # (if present) is actually called URI. In everyday language, the term URI is often used for everything including the fragment identifier; this paper follows this practice because internationalization considerations apply to the fragment identifier without exception.

In a well-defined context (e.g. in a Web page that has its own URI), it is possible to use relative URIs, which can be extremely short. [RFC2396] defines exactly how relative URIs can be converted into absolute URIs. Again, the distinction between relative and absolute URIs is not relevant for internationalization. Below, we will use very short examples, which can be understood to be relative URIs.

The first part of an absolute URI, up to the first colon, is the schema. Well-know schemas include http:, ftp:, and so on. The schema defines both the syntax (within the general limits of the URI syntax) and the semantics of the URIs in this schema, including character encoding.

Overview

The next section discusses character encoding in URIs, from the legacy of undefined character encoding towards the consistent use of UTF-8. Section 3 introduces IRIs as the internationalized equivalent of URIs. Section 4 deals with specific aspecs such as query part internationalization, domain name internationalization, and bidirectionality. Section 5 discusses testing and future work.

2. From Legacy to Consistency

This section discusses the evolution from legacy URI character handling to the use of UTF-8 for consistent URI character handling. For completeness, some other approaches to URI internationalization that have been proposed in the past are also discussed.

Legacy URI Character Handling

Older specifications for URIs [RFC1630] do not clearly distinguish between characters and bytes, and to some extend assume the use of iso-8859-1. With the quick growth of the Web beyond the area covered by iso-8859-1, this assumption became obsolete.

[RFC2396], the specification currently defining URIs, explains how characters get encoded into URIs in Section 2.1. A sequence of original characters (e.g. in a domain name or a file name) is mapped to a sequence of bytes. This sequence of bytes is then mapped to a sequence of URI characters. Both mappings can work both ways.

The second mapping, from bytes to URI characters, is well defined. For the bytes corresponding to a subset of us-ascii, the us-ascii encoding is used. This subset includes all letters and digits and a small number of symbols (called unreserved in [RFC2396]: '-', '_', '.', '!', '~', '*', "'", '(', ')'). For all the other bytes, a '%' followed by two hexadecimal digits is used. This is called %-encoding or %HH-encoding. The escaping also affects all the syntactically relevant characters such as '/', '#', '%', and so on. As an example, a simple % has to be escaped to %25 to clearly distinguish a 'payload' % characters from a % used in a %-escape. It is also possible to escape additional bytes. As an example, an 'a' can always be escaped to %61, although this is done extremely rarely.

Unfortunately, the first mapping, from original characters to bytes, is not well defined. For characters in the us-ascii range, the us-ascii encoding is used, but for other characters, [RFC2396] explicitly leaves the encoding undefined, and defers it to future specifications. The resulting situation is depicted in Table 1.

  encoding undefined   us-ascii or %HH  
original characters <======> bytes <======> URI characters
March us-ascii 4D 61 72 63 68 March
März iso-8859-1 4D E4 72 7A M%E4rz
März macintosh 4D 8A 72 7A M%8Arz
März utf-8 4D C3 A4 72 7A M%C3%A4rz

Table 1: Mapping between original characters and URI chararcters, with examples.

This shows clearly that there is a very strong asymetry between the characters in the US-ASCII range and other characters. For characters in the US-ASCII range, the overall mapping is the identity. From protocol designers to end users, nobody is really aware that there are two mappings; the identity is taken for granted. For other characters, there is a double handicap: they get converted to unreadable escapes, and the encoding used gets lost. When trying to back-convert, one could for example end up with M‰rz or März.

Please note that there is no requirement that URIs are constructed from original characters. It is also possible to directly start with bytes in the case of digital data. However, the only known URI scheme that allows to directly encode (binary) data, the data: scheme [RFC2397], uses base64 for easier readability and shortness.

Character Encoding in URIs based on UTF-8

The above situation can be improved in two steps. The first step consists in converging on a single encoding. The second step consists in extending the number of characters allowed in URIs. The first step is described here. The second step is described in Section 3. Both steps are designed to be introduced in parallel and to reinforce each other.

Allowing arbitrary encodings for non-ASCII characters in URIs creates unnecessary confusion. Converging on a single encoding is highly desirable. UTF-8 is the encoding of choice for the following main reasons:

[RFC2396] leaves details of URI syntax, including the issue of character encoding, to scheme-specific definitions. Some of these in turn leave the choice of encoding to the individual creators of URIs. Given this situation, it was not possible to suddenly declare that UTF-8 should be the only encoding, used in all URIs. [RFC2718], section 2.2.5, however clearly recommends using UTF-8 for new URI schemes.

There are the following ways in which an URI scheme can adopt UTF-8:

There are also parts of URIs that are independent of URI schemes, in particular the fragment identifier (the part after the #). Fragment identifiers are separated from URIs before resolution, and applied to the resolved resource depending on its MIME type. The syntax of fragment identifiers is defined by the format used for the resource, e.g. HTML. An syntax for more flexible fragment identifiers is [XPointer], which also is defined to use UTF-8.

Other proposed Solutions

To address the problems with URI internationalization, other approaches have also been considered initially. However, they have been discarded years ago because each of them had severe problems. We are discussing them here mainly to help understand why the UTF-8 solution was adopted.

Some proposals tried to extend the tagging approach used for MIME (e.g. email) headers and body parts, i.e. to indicate the encoding used for each URI or URI component. This was quickly rejected for a large number of reasons. Adding tags would lengthen the URI considerably. There would be confusion about whether the tagging indicated the encoding at the current place, or the encoding to be used when converting from characters to protocol bytes. Encoding tags would have to be added to URIs on paper, which would be highly counterintuitive. Also, URI resolvers and other software would have to know all encodings used.

One other proposal was to use (a variant of) UTF-7 instead of UTF-8. There would have been some slight advantage in length against the escaped UTF-8 form. However, it would be difficult or impossible to keep original US-ASCII characters and the characters produced by UTF-7 apart.

A convention that at some time enjoyed some popularity, and is actually produced or accepted by a few implementations, was to use %uHHHH (where HHHH is the four-digit hexadecimal number of the code unit in Unicode). The advantage would be that it is clear that %uHHHH is an encoding of a character based on Unicode. The problem is that it is new syntax that will easily confuse older implementations.

3. Internationalized Resource Identifiers

Converging on UTF-8 for the conversion between original characters and URIs is a very important step ahead, but still requires %HH-escaping. The obvious goal is to get rid of %HH-escaping whenever possible, and to just reach the same 'identity conversion' as for us-ascii. For this, the convergence to UTF-8 is an important prerequisite, because otherwise conversion to (traditional) URIs is not clearly defined.

The resulting construct has been called Internationalized URI and Globalized URI, but recently, we have adopted the term Internationalized Resource Identifier (IRI). The change of terminology made it quite a bit easier to describe the concepts, because it was possible to avoid lengthy terms such as non-internationalized URI. However, dropping the 'U' (uniform or universal) does not at all mean that these principles have been dropped; IRIs maintain these principles, and in some sense are actually more uniform and universal. It also does not mean that IRIs should be limited to very special places. IRIs can and should replace URIs wherever possible. The use of two clearly distinct terms makes it easier to describe this replacement in specifications. Whether the general public will ever adopt the term IRI is a different question.

In principle, the definition of IRI is very easy: It is the same as an URI, except that wherever %HH is allowed, non-URI characters are also allowed. As a result, using non-ASCII characters in IRIs becomes as easy and straightforward as using us-ascii characters in URIs. For convenience, the resolution of IRIs is defined via a conversion to URIs. However, this does not mean that an actual conversion to URIs is always needed. Conversion from IRIs to URIs is straightforward. All the characters not allowed in URIs are %HH-escaped after a conversion to bytes based on UTF-8. This is shown in Table 2.

  encoding   utf-8 or %HH us-ascii or %HH
original <======> bytes IRI URI
March utf-8 4D 61 72 63 68 March March
März utf-8 4D C3 A4 72 7A März M%C3%A4rz
März iso-8859-1 4D E4 72 7A M%E4rz M%E4rz
März macintosh 4D 8A 72 7A M%8Arz M%8Arz

Table 2: Original characters, IRIs, and URIs.

Table 2 also shows that IRIs do not exclude URIs based on legacy encodings (last two rows). However, because these URIs do not use UTF-8, %HH-escaping has to be used. There are other cases where %HH-escaping has to be used or can be used; all together, there are the following cases:

IRI Specification Details

While the basic idea of IRIs is extremely simple, there are some details that have to be addressed carefully. These include digital transport of IRIs, normalization issues, and conversion from URIs to IRIs.

Of course IRIs are designed to be transported digitally. One important detail which may be easy to overlook in this case is that IRIs, in the same way as URIs, are sequences of characters, not sequences of bytes. This is obvious when IRIs appear on paper or other physical transport media. For digital formats and protocols, it means that the character encoding of IRIs follows the character encoding conventions of the format or protocol in question. To take a particular example, an email body part or Web page encoded in iso-8859-1 would use the E4 byte to encode the character 'ä' in the IRI März, in the same way it uses this byte for all the other 'ä' characters. Conversion to UTF-8 only occurs when converting the IRI to an URI or when directly passing the IRI to the resolution API (assuming that API uses UTF-8 and not e.g. UTF-16).

Unicode allows multiple encoding variants for certain characters. For example, many characters with diacritics can be represented both in precomposed and in decomposed form. On paper and similar transport media, there is no difference. When converting from a form that does not make such destinctions to a form where these distinctions are relevant, a particular encoding must be chosen consistently. For many reasons outlined in [NormReq], the form to choose is Normalization Form C (NFC) [UTF15].

While conversion from IRIs to URIs is quite straghtforward, the reverse conversion is more difficult. The main problem is that it is not clear whether some %HH-escaping sequence was the result of encoding some characters using UTF-8 or whether it was produced otherwise. However, UTF-8 byte sequences are highly regular and very rarely coincide with byte sequences from legacy encodings (see [Dür97]). Also, just testing against UTF-8 byte patterns is not sufficient. Before converting back, various other conditions have to be checked against. These include non-allocated code points, non-normalized code sequences, and characters not suitable in IRIs, such as formating characters and various kinds of spaces and compatibility characters.

Specifications using IRIs

Over the last years, a number of core Web specifications have adopted what now is called IRI. The most important ones are:

As [IRI] is still only a draft, these specifications include explicit definitions of IRI behavior. The following quote shows how this is done in [XLink]; other specifications contain very similar provisions.

Some characters are disallowed in URI references, even if they are allowed in XML; the disallowed characters include all non-ASCII characters, plus the excluded characters listed in Section 2.4 of [IETF RFC 2396], except for the number sign (#) and percent sign (%) and the square bracket characters re-allowed in [IETF RFC 2732]. Disallowed characters must be escaped as follows:

  1. Each disallowed character is converted to UTF-8 [IETF RFC 2279] as one or more bytes.
  2. Any bytes corresponding to a disallowed character are escaped with the URI escaping mechanism (that is, converted to %HH, where HH is the hexadecimal notation of the byte value).
  3. The original character is replaced by the resulting character sequence.

One remaining problem is the treatment of those us-ascii characters that are not allowed in URIs, i.e. space and various delimiters such as '<', '>', '{', '}',... The current wording in [IRI] excludes them, but most of the specifications above actually allow them. It is expected that [IRI] will be updated to allow them, but will contain a strong warning against using them. The reasons for allowing them are that many actual implementations already tolerate them, and that in some contexts, in particular in XML attributes, many of them do not cause any problems, while other characters such as '&' actually have to be escaped (not with URI %HH-escaping, but with XML escaping such as &amp;). The reason for a strong warning is that IRIs that contain these characters cannot easily be transferred from one context to another.

Conditions for Using IRIs

There are two groups of conditions for the use of IRIs. The first group comprises basic operational conditions to deal with non-ASCII characters on the computer, and includes keyboard and display logic. The second group contains the two conditions necessary for IRIs to actually work in a particular context:

4. Specific Aspects of URI Internationalization

Forms and Query Parts

HTML Forms are an important part of interactivity and client-server communication on the Web. The most frequent way of sending information back to the server after filling in a form is by appending it to the request URI after a '?', in the so-called query part.

Internationalization of query parts is very important, because form character data naturally contains non-ascii characters. Unfortunately, because character encoding of URIs was not originally specified, various heuristics developed:

Sending out a page in UTF-8 alleviates many problems with Web internationalisation, including this the problem of query parts. [XForms], the new generation of forms currently under development at W3C, provides hope for a final solution to all the above problems.

Domain Names

Many kinds of URIs can contain domain names (for example www.example.com). Currently, domain names only allow a very restricted set of characters, a subset of the characters allowed in URIs. The IETF IDN Working Group [IDN] is working on a solution to allow a large number of characters from Unicode in domain names. The solution that currently has the most support for use inside the domain name system itself is a mapping of Unicode characters back to ASCII characters, a so called ACE (Ascii Compatible Encoding).

Independent of the solution chosen for the domain name system itself, uniformity of character representation is important for domain name components in URIs and IRIs, too. [IDN-URI] therefore proposes to adopt UTF-8 followed by %HH escaping for the domain name parts of URIs, which means that in IRIs, domain names from all kinds of scripts can be used naturally. [IDN-URI] also extends the generic URI syntax of [RFC2396] to allow %HH-escapes in the domain name part.

Bidirectionality

Scripts such as Arabic and Hebrew are written from right to left. Combined with letters from other scritps or digits, this leads to the problem of mixed directionality or bidirectionality. Bidirectionality for URIs and IRIs is a difficult problem due to three contradicting requirements:

[Atkin] proposes a reordering algorithm for domain names that takes a good balance between liminations on character combinations (e.g. both Latin and Arabic in the same component) and complexity. [IRI-bidi] explains how to use bidi control characters to bridge the gap between an IRI-specific algorithm and the Unicode algorithm. A combination of ideas from both proposals may lead to the best solution under the given boundary conditions.

5. Testing and Future Work

The previous chapters have shown how to move towards a consistent and user-friendly architecture for internationalized URIs. IRIs allow to use a wide range of characters directly with the same syntax as URIs. Conversion to URIs is done by encoding in UTF-8 and then using %HH-escaping as necessary. This fits together with the adopition of UTF-8 for more and more URI schemes.

Various experiences with W3C specifications over the last few years has shown that test suites can be a very efficient way to improve specifications and their implementation and deployment. We are therefore currently working on some test for IRIs. A first version is publicly available at http://www.w3.org/2001/08/iri-test. Tests will include documents conforming to different specifications (HTML, XML, XML Schema,...) in different encodings. Each test will test the functionality of a particular IRI in the document. Some tests will be added to make sure that the basic functionality for the test is available and that the test is executed correctly. These will include both tests using us-ascii only as well as basic tests for each of the encodings used. A first version of the tests only available to W3C members already showed some encouraging results.

Other future work in particular includes moving the various specifications currently in draft stage further along the W3C or IETF specification process.

Acknowledgements

All opinions and errors in this paper are purely those of the author. There are many people who have to be acknowledged for URI internatinalization, too many to list them all. The main thanks go to François Yergeau, who had the idea both for using UTF-8 and for how to address bidi problems, and to Larry Masinter, for providing both help as well as creative pushback.

References

RFCs are available at many other locations, among else from http://www.ietf.org/rfc/rfcNNNN.txt, where NNNN is the RFC number. Internet-Drafts are work in progress and are frequently updated. They can be found at http://www.ietf.org/internet-drafts/xxxx, where xxxx is the name of the draft. Please check whether there is a new version, with a higher sequence number (e.g. -08.txt in place of -07.txt).

[Atkin]
Steve Atkin, Bidirectionality and Domain Names, Proc. 19th International Unicode Conference, San Jose, CA, Sept. 2001.
[CharMod]
Martin J. Dürst, François Yergeau, Asmus Freytag, ASMUS, and Tex Texin, Character Model for the World Wide Web 1.0, W3C Working Draft January 2001, newest version available from http://www.w3.org/TR/charmod.
[Dür97]
Martin J. Dürst, The Properties and Promises of UTF-8, Proc. 11th International Unicode Conference, San Jose, CA, Sept. 1997, also available as http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.
[Dür97a]
Martin J. Dürst, Internationalizing Internet Identifiers, Proc. 11th International Unicode Conference, San Jose, CA, Sept. 1997, also available as http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-URI.pdf.
[Gettys]
Jim Gettys, URI Model Conseqences, at http://www.w3.org/DesignIssues/ModelConsequences.
[HTML4]
Dave Raggett, Arnaud Le Hors, and Ian Jacobs, HTML 4.01 Specification, W3C Recommendation, December 1999, available at http://www.w3.org/TR/html401.
[IDN]
IETF Internationalized Domain Name (idn) Working Group. Information at http://www.ietf.org/html.charters/idn-charter.html.
[IDN-URI]
Martin Dürst, Internationalized Domain Names in URIs and IRIs, Internet-Draft draft-ietf-idn-uri-00.txt, January 2001.
[IRI]
Larry Masinter and Martin Dürst, Internationalized Resource Identifiers (IRI), Internet-Draft draft-masinter-url-i18n-07.txt, January 2001.
[IRI-bidi]
Martin Dürst, Internet Identifiers and Bidirectionality, Internet-Draft draft-duerst-iri-bidi-00.txt, July 2001.
[NormReq]
Martin Dürst, Requirements for String Identity Matching and String Indexing, W3C Working Draft, July 1998, available at http://www.w3.org/TR/WD-charreq.
[RFC1630]
Tim Berners-Lee, Universal Resource Identifiers in WWW, RFC 1630, June 1994.
[RFC2141]
R. Moats, URN Syntax, RFC 2141, May 1997.
[RFC2192]
C. Newman, IMAP URL Scheme, RFC 2192, September 1997.
[RFC2277]
H. Alvestrand, IETF Policy on Character Sets and Languages, RFC 2277, BCP 18, January 1998.
[RFC2396]
T. Berners-Lee, R. Fielding, and L. Masinter, Uniform Resource Identifiers (URI): Generic Syntax, RFC 2396, August 1998.
[RFC2397]
L. Masinter, The "data" URL scheme, RFC 2397, August 1998.
[RFC2640]
B. Curtin, Internationalization of the File Transfer Protocol, RFC 2640, July 1999.
[RFC2718]
L. Masinter, H. Alvestrand, D. Zigmond, and R. Petke, Guidelines for new URL Schemes, RFC 2718, November 1999.
[UTF15]
Mark Davis and Martin Dürst, Unicode Standard Annex #15: Unicode Normalization Forms, March 2001, available at http://www.unicode.org/unicode/reports/tr15.
[XForms]
Micah Dubinko et al., Eds., XForms 1.0, W3C Working Draft, June 2001, newest version available at http://www.w3.org/TR/xforms.
[XLink]
Steve DeRose, Eve Maler, and David Orchard, Eds., XML Linking Language (XLink) Version 1.0, W3C Recommendation, June 2001, available at http://www.w3.org/TR/xlink.
[XML]
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, and Eve Maler, Eds., Extensible Markup Language (XML) 1.0 (Second Edition), W3C Recommendation October 2000 (First edition February 1998; errata at http://www.w3.org/XML/xml-V10-2e-errata), available at http://www.w3.org/TR/REC-xml.
[XMLSchema]
Paul V. Biron and Ashok Malhotra, XML Schema Part 2: Datatypes, W3C Recommendation, May 2001, available at http://www.w3.org/TR/xmlschema-2.
[XPointer]
Steve DeRose, Eve Maler, and Ron Daniel Jr., Eds., XML Pointer Language (XPointer) Version 1.0, W3C Last Call Working Draft, January 2001, newest version available from http://www.w3.org/TR/xptr.