19th International Unicode Conference, September 2001, San Jose, CA
Martin J. Dürst
W3C/Keio
University
mailto:duerst@w3.org
http://www.w3.org/People/Dürst
Keywords: Uniform Resource Identifiers (URI), Internationalized Resource Identifiers (IRI), UTF-8
Uniform Resource Identifiers (URIs) are a core component of the Web. Internationalized Resource Identifiers (IRIs) are equivalent to URIs except that they remove the limitation that only a subset of us-ascii can be used. Conversion between IRIs and URIs is based on the UTF-8 character encoding followed by %-escaping. This matches well with an increasing number of URI schemes and components that use UTF-8 as their encoding. This paper discusses URI internationalization in detail, including motivation, architecture, specifications, and testing.
This section discusses the motivation for the internationalization of URIs and gives some basic introduction to URIs, their properties, and their components. Uniform Resource Identifiers (URIs) [RFC2396] are one of the three basic components of the original World Wide Web architecture (the other two being HTTP and HTML). URIs are the glue of the World Wide Web, they are used to identify virtually everything of importance from Web pages and services to email addresses, telnet connections, and telephone calls.
On average, URIs use a mixture of readable parts and syntax that is cryptic at least at first glance. For example, this paper will be available at http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html. In this example, the reader may be able to correlate several of the components with the date of the talk, the conference name, the title, and so on. Some of these correlations may be wrong, or may go unnoticed. But the experience with URIs over the last few years, as well as with many other kinds of identifiers, shows that there is a continuing desire for people to make use of such correlations. In particular, such correlations are useful for the following purposes [Dür97a]:
All these operations are much easier if people can use their native script. This is a very clear motivation for making sure URIs are appropriately internationalized. In addition, URIs may contain query parts, where it is important that characters can be sent to the server reliably (see Section 4).
The basic property of URIs is that they are identifiers, i.e. they stand for something else. The 'something else' is called the resource, and the process of obtaining the resource is called resolution. URIs have a number of additional important properties. The properties most important in the context of this discussion are uniformity and transcribability. For completeness, this subsection also shortly discusses universality and the distinctions between URLs and URNs.
Uniformity refers to the fact that certain syntactic conventions are associated with certain operations for all URIs. As an example, the characters '#' or '/' always have the same function whenever they appear in an URI. This does not mean that every URI has a '/', or that all URI schemes allow '/', but it guarantees that the operations associated with '/' characters in URIs can be executed uniformly for all URIs. A thourough discussion of the importance of uniformity for the current and future operation of the Web can be found in [Gettys].
Uniformity was in some cases used as an argument against URI internationalization. Using a small and uniform set of characters would allow any URI to be used by anybody, on almost any type of device. However, many URIs are predominantly used by people knowing a particular script, and it is much better to optimize these URIs for these users than to optimize it for the remaining small minority that is not familliar with the script.
While the above discussion applies to the final form of URIs, uniformity is definitely very important when looking at character encoding issues. As Section 2 will show, this unfortunately has not been recognized from the start.
The second important property of URIs in the context of internationalization is that they are not only used inside digital systems as protocol elements, but are also used on paper as well as in people's minds. These kinds of transcriptions are important for URI internationalization in various ways. As said above they are one of the main motivations for internationalizing URIs to the point where a wide range of characters can directly be used.
URIs are also known as Universal Resource Identifiers. This refers to the fact that anything of importance can be given an URI, and any existing system of identifiers can be subsumed by URIs. It does not mean that URIs are the only such system possible, but it is currently the most visible, successful, and widely used one.
URIs are often partitioned into URLs (Uniform Resource Locators) and URNs
(Uniform Resource Names). Internationalization of URIs is orthogonal to this
distinction, and so only a very short summary is given here. Depending on the
context of the discussion, this is done in at least three ways. First, in an
abstract sense, there is an attempt to make a distinction between names and
addresses (locations). This works very well for physical entities such as
human beings or books in a library, but gets heavily blurred in the case of a
digital network with numerous indirections and caching mechanisms. Second, in
an intentional sense, URNs are often positioned for more persistent use.
Third, in a syntactical sense, URNs are distinguished as those URIs that
start with the prefix (scheme name) urn:
.
URI syntax is defined so that various parts of an URI can be clearly
identified if present. First, according to [RFC2396],
what goes into the href
attribute of the <a>
element in HTML and similar places is called an URI Reference. This
includes the part after the #, the so-called fragment identifier.
For [RFC2396] and specifications referring to it, only
the part before the # (if present) is actually called URI. In
everyday language, the term URI is often used for everything including the
fragment identifier; this paper follows this practice because
internationalization considerations apply to the fragment identifier without
exception.
In a well-defined context (e.g. in a Web page that has its own URI), it is possible to use relative URIs, which can be extremely short. [RFC2396] defines exactly how relative URIs can be converted into absolute URIs. Again, the distinction between relative and absolute URIs is not relevant for internationalization. Below, we will use very short examples, which can be understood to be relative URIs.
The first part of an absolute URI, up to the first colon, is the schema.
Well-know schemas include http:
, ftp:
, and so on.
The schema defines both the syntax (within the general limits of the URI
syntax) and the semantics of the URIs in this schema, including character
encoding.
The next section discusses character encoding in URIs, from the legacy of undefined character encoding towards the consistent use of UTF-8. Section 3 introduces IRIs as the internationalized equivalent of URIs. Section 4 deals with specific aspecs such as query part internationalization, domain name internationalization, and bidirectionality. Section 5 discusses testing and future work.
This section discusses the evolution from legacy URI character handling to the use of UTF-8 for consistent URI character handling. For completeness, some other approaches to URI internationalization that have been proposed in the past are also discussed.
Older specifications for URIs [RFC1630] do not clearly distinguish between characters and bytes, and to some extend assume the use of iso-8859-1. With the quick growth of the Web beyond the area covered by iso-8859-1, this assumption became obsolete.
[RFC2396], the specification currently defining URIs, explains how characters get encoded into URIs in Section 2.1. A sequence of original characters (e.g. in a domain name or a file name) is mapped to a sequence of bytes. This sequence of bytes is then mapped to a sequence of URI characters. Both mappings can work both ways.
The second mapping, from bytes to URI characters, is well defined. For the bytes corresponding to a subset of us-ascii, the us-ascii encoding is used. This subset includes all letters and digits and a small number of symbols (called unreserved in [RFC2396]: '-', '_', '.', '!', '~', '*', "'", '(', ')'). For all the other bytes, a '%' followed by two hexadecimal digits is used. This is called %-encoding or %HH-encoding. The escaping also affects all the syntactically relevant characters such as '/', '#', '%', and so on. As an example, a simple % has to be escaped to %25 to clearly distinguish a 'payload' % characters from a % used in a %-escape. It is also possible to escape additional bytes. As an example, an 'a' can always be escaped to %61, although this is done extremely rarely.
Unfortunately, the first mapping, from original characters to bytes, is not well defined. For characters in the us-ascii range, the us-ascii encoding is used, but for other characters, [RFC2396] explicitly leaves the encoding undefined, and defers it to future specifications. The resulting situation is depicted in Table 1.
encoding undefined | us-ascii or %HH | |||
original characters | <======> |
bytes | <======> |
URI characters |
---|---|---|---|---|
March | us-ascii | 4D 61 72 63 68 |
March |
|
März | iso-8859-1 | 4D E4 72 7A |
M%E4rz |
|
März | macintosh | 4D 8A 72 7A |
M%8Arz |
|
März | utf-8 | 4D C3 A4 72 7A |
M%C3%A4rz |
Table 1: Mapping between original characters and URI chararcters, with examples.
This shows clearly that there is a very strong asymetry between the characters in the US-ASCII range and other characters. For characters in the US-ASCII range, the overall mapping is the identity. From protocol designers to end users, nobody is really aware that there are two mappings; the identity is taken for granted. For other characters, there is a double handicap: they get converted to unreadable escapes, and the encoding used gets lost. When trying to back-convert, one could for example end up with M‰rz or März.
Please note that there is no requirement that URIs are constructed from original characters. It is also possible to directly start with bytes in the case of digital data. However, the only known URI scheme that allows to directly encode (binary) data, the data: scheme [RFC2397], uses base64 for easier readability and shortness.
The above situation can be improved in two steps. The first step consists in converging on a single encoding. The second step consists in extending the number of characters allowed in URIs. The first step is described here. The second step is described in Section 3. Both steps are designed to be introduced in parallel and to reinforce each other.
Allowing arbitrary encodings for non-ASCII characters in URIs creates unnecessary confusion. Converging on a single encoding is highly desirable. UTF-8 is the encoding of choice for the following main reasons:
[RFC2396] leaves details of URI syntax, including the issue of character encoding, to scheme-specific definitions. Some of these in turn leave the choice of encoding to the individual creators of URIs. Given this situation, it was not possible to suddenly declare that UTF-8 should be the only encoding, used in all URIs. [RFC2718], section 2.2.5, however clearly recommends using UTF-8 for new URI schemes.
There are the following ways in which an URI scheme can adopt UTF-8:
There are also parts of URIs that are independent of URI schemes, in particular the fragment identifier (the part after the #). Fragment identifiers are separated from URIs before resolution, and applied to the resolved resource depending on its MIME type. The syntax of fragment identifiers is defined by the format used for the resource, e.g. HTML. An syntax for more flexible fragment identifiers is [XPointer], which also is defined to use UTF-8.
To address the problems with URI internationalization, other approaches have also been considered initially. However, they have been discarded years ago because each of them had severe problems. We are discussing them here mainly to help understand why the UTF-8 solution was adopted.
Some proposals tried to extend the tagging approach used for MIME (e.g. email) headers and body parts, i.e. to indicate the encoding used for each URI or URI component. This was quickly rejected for a large number of reasons. Adding tags would lengthen the URI considerably. There would be confusion about whether the tagging indicated the encoding at the current place, or the encoding to be used when converting from characters to protocol bytes. Encoding tags would have to be added to URIs on paper, which would be highly counterintuitive. Also, URI resolvers and other software would have to know all encodings used.
One other proposal was to use (a variant of) UTF-7 instead of UTF-8. There would have been some slight advantage in length against the escaped UTF-8 form. However, it would be difficult or impossible to keep original US-ASCII characters and the characters produced by UTF-7 apart.
A convention that at some time enjoyed some popularity, and is actually produced or accepted by a few implementations, was to use %uHHHH (where HHHH is the four-digit hexadecimal number of the code unit in Unicode). The advantage would be that it is clear that %uHHHH is an encoding of a character based on Unicode. The problem is that it is new syntax that will easily confuse older implementations.
Converging on UTF-8 for the conversion between original characters and URIs is a very important step ahead, but still requires %HH-escaping. The obvious goal is to get rid of %HH-escaping whenever possible, and to just reach the same 'identity conversion' as for us-ascii. For this, the convergence to UTF-8 is an important prerequisite, because otherwise conversion to (traditional) URIs is not clearly defined.
The resulting construct has been called Internationalized URI and Globalized URI, but recently, we have adopted the term Internationalized Resource Identifier (IRI). The change of terminology made it quite a bit easier to describe the concepts, because it was possible to avoid lengthy terms such as non-internationalized URI. However, dropping the 'U' (uniform or universal) does not at all mean that these principles have been dropped; IRIs maintain these principles, and in some sense are actually more uniform and universal. It also does not mean that IRIs should be limited to very special places. IRIs can and should replace URIs wherever possible. The use of two clearly distinct terms makes it easier to describe this replacement in specifications. Whether the general public will ever adopt the term IRI is a different question.
In principle, the definition of IRI is very easy: It is the same as an URI, except that wherever %HH is allowed, non-URI characters are also allowed. As a result, using non-ASCII characters in IRIs becomes as easy and straightforward as using us-ascii characters in URIs. For convenience, the resolution of IRIs is defined via a conversion to URIs. However, this does not mean that an actual conversion to URIs is always needed. Conversion from IRIs to URIs is straightforward. All the characters not allowed in URIs are %HH-escaped after a conversion to bytes based on UTF-8. This is shown in Table 2.
encoding | utf-8 or %HH | us-ascii or %HH | ||
original | <======> |
bytes | IRI | URI |
---|---|---|---|---|
March | utf-8 | 4D 61 72 63 68 |
March |
March |
März | utf-8 | 4D C3 A4 72 7A |
März |
M%C3%A4rz |
März | iso-8859-1 | 4D E4 72 7A |
M%E4rz |
M%E4rz |
März | macintosh | 4D 8A 72 7A |
M%8Arz |
M%8Arz |
Table 2: Original characters, IRIs, and URIs.
Table 2 also shows that IRIs do not exclude URIs based on legacy encodings (last two rows). However, because these URIs do not use UTF-8, %HH-escaping has to be used. There are other cases where %HH-escaping has to be used or can be used; all together, there are the following cases:
März
rather than
M%C3%A4rz
because this conserves the
identity of the character a-Umlaut.While the basic idea of IRIs is extremely simple, there are some details that have to be addressed carefully. These include digital transport of IRIs, normalization issues, and conversion from URIs to IRIs.
Of course IRIs are designed to be transported digitally. One important detail which may be easy to overlook in this case is that IRIs, in the same way as URIs, are sequences of characters, not sequences of bytes. This is obvious when IRIs appear on paper or other physical transport media. For digital formats and protocols, it means that the character encoding of IRIs follows the character encoding conventions of the format or protocol in question. To take a particular example, an email body part or Web page encoded in iso-8859-1 would use the E4 byte to encode the character 'ä' in the IRI März, in the same way it uses this byte for all the other 'ä' characters. Conversion to UTF-8 only occurs when converting the IRI to an URI or when directly passing the IRI to the resolution API (assuming that API uses UTF-8 and not e.g. UTF-16).
Unicode allows multiple encoding variants for certain characters. For example, many characters with diacritics can be represented both in precomposed and in decomposed form. On paper and similar transport media, there is no difference. When converting from a form that does not make such destinctions to a form where these distinctions are relevant, a particular encoding must be chosen consistently. For many reasons outlined in [NormReq], the form to choose is Normalization Form C (NFC) [UTF15].
While conversion from IRIs to URIs is quite straghtforward, the reverse conversion is more difficult. The main problem is that it is not clear whether some %HH-escaping sequence was the result of encoding some characters using UTF-8 or whether it was produced otherwise. However, UTF-8 byte sequences are highly regular and very rarely coincide with byte sequences from legacy encodings (see [Dür97]). Also, just testing against UTF-8 byte patterns is not sufficient. Before converting back, various other conditions have to be checked against. These include non-allocated code points, non-normalized code sequences, and characters not suitable in IRIs, such as formating characters and various kinds of spaces and compatibility characters.
Over the last years, a number of core Web specifications have adopted what now is called IRI. The most important ones are:
xlink:href
attribute that is used for all XLink links is an
IRI.anyURI
datatype as including IRI functionality. The 'any'
prefix covers both the full spectrum of URI Syntax including URI
References as well as non-ASCII characters.As [IRI] is still only a draft, these specifications include explicit definitions of IRI behavior. The following quote shows how this is done in [XLink]; other specifications contain very similar provisions.
Some characters are disallowed in URI references, even if they are allowed in XML; the disallowed characters include all non-ASCII characters, plus the excluded characters listed in Section 2.4 of [IETF RFC 2396], except for the number sign (#) and percent sign (%) and the square bracket characters re-allowed in [IETF RFC 2732]. Disallowed characters must be escaped as follows:
- Each disallowed character is converted to UTF-8 [IETF RFC 2279] as one or more bytes.
- Any bytes corresponding to a disallowed character are escaped with the URI escaping mechanism (that is, converted to %HH, where HH is the hexadecimal notation of the byte value).
- The original character is replaced by the resulting character sequence.
One remaining problem is the treatment of those us-ascii characters that are not allowed in URIs, i.e. space and various delimiters such as '<', '>', '{', '}',... The current wording in [IRI] excludes them, but most of the specifications above actually allow them. It is expected that [IRI] will be updated to allow them, but will contain a strong warning against using them. The reasons for allowing them are that many actual implementations already tolerate them, and that in some contexts, in particular in XML attributes, many of them do not cause any problems, while other characters such as '&' actually have to be escaped (not with URI %HH-escaping, but with XML escaping such as &). The reason for a strong warning is that IRIs that contain these characters cannot easily be transferred from one context to another.
There are two groups of conditions for the use of IRIs. The first group comprises basic operational conditions to deal with non-ASCII characters on the computer, and includes keyboard and display logic. The second group contains the two conditions necessary for IRIs to actually work in a particular context:
HTML Forms are an important part of interactivity and client-server communication on the Web. The most frequent way of sending information back to the server after filling in a form is by appending it to the request URI after a '?', in the so-called query part.
Internationalization of query parts is very important, because form character data naturally contains non-ascii characters. Unfortunately, because character encoding of URIs was not originally specified, various heuristics developed:
User-Agent:
header field. This is very difficult, but
sometimes the only solution.Sending out a page in UTF-8 alleviates many problems with Web internationalisation, including this the problem of query parts. [XForms], the new generation of forms currently under development at W3C, provides hope for a final solution to all the above problems.
Many kinds of URIs can contain domain names (for example www.example.com). Currently, domain names only allow a very restricted set of characters, a subset of the characters allowed in URIs. The IETF IDN Working Group [IDN] is working on a solution to allow a large number of characters from Unicode in domain names. The solution that currently has the most support for use inside the domain name system itself is a mapping of Unicode characters back to ASCII characters, a so called ACE (Ascii Compatible Encoding).
Independent of the solution chosen for the domain name system itself, uniformity of character representation is important for domain name components in URIs and IRIs, too. [IDN-URI] therefore proposes to adopt UTF-8 followed by %HH escaping for the domain name parts of URIs, which means that in IRIs, domain names from all kinds of scripts can be used naturally. [IDN-URI] also extends the generic URI syntax of [RFC2396] to allow %HH-escapes in the domain name part.
Scripts such as Arabic and Hebrew are written from right to left. Combined with letters from other scritps or digits, this leads to the problem of mixed directionality or bidirectionality. Bidirectionality for URIs and IRIs is a difficult problem due to three contradicting requirements:
[Atkin] proposes a reordering algorithm for domain names that takes a good balance between liminations on character combinations (e.g. both Latin and Arabic in the same component) and complexity. [IRI-bidi] explains how to use bidi control characters to bridge the gap between an IRI-specific algorithm and the Unicode algorithm. A combination of ideas from both proposals may lead to the best solution under the given boundary conditions.
The previous chapters have shown how to move towards a consistent and user-friendly architecture for internationalized URIs. IRIs allow to use a wide range of characters directly with the same syntax as URIs. Conversion to URIs is done by encoding in UTF-8 and then using %HH-escaping as necessary. This fits together with the adopition of UTF-8 for more and more URI schemes.
Various experiences with W3C specifications over the last few years has shown that test suites can be a very efficient way to improve specifications and their implementation and deployment. We are therefore currently working on some test for IRIs. A first version is publicly available at http://www.w3.org/2001/08/iri-test. Tests will include documents conforming to different specifications (HTML, XML, XML Schema,...) in different encodings. Each test will test the functionality of a particular IRI in the document. Some tests will be added to make sure that the basic functionality for the test is available and that the test is executed correctly. These will include both tests using us-ascii only as well as basic tests for each of the encodings used. A first version of the tests only available to W3C members already showed some encouraging results.
Other future work in particular includes moving the various specifications currently in draft stage further along the W3C or IETF specification process.
All opinions and errors in this paper are purely those of the author. There are many people who have to be acknowledged for URI internatinalization, too many to list them all. The main thanks go to François Yergeau, who had the idea both for using UTF-8 and for how to address bidi problems, and to Larry Masinter, for providing both help as well as creative pushback.
RFCs are available at many other locations, among else from http://www.ietf.org/rfc/rfcNNNN.txt, where NNNN is the RFC number. Internet-Drafts are work in progress and are frequently updated. They can be found at http://www.ietf.org/internet-drafts/xxxx, where xxxx is the name of the draft. Please check whether there is a new version, with a higher sequence number (e.g. -08.txt in place of -07.txt).