19th International Unicode Conference, September 2001, San Jose, CA

Internationalized Resource Identifiers:
From Specification to Testing

Martin J. Dürst
W3C/Keio University
mailto:duerst@w3.org
http://www.w3.org/People/Dürst

Keywords: Uniform Resource Identifiers (URI), Internationalized Resource Identifiers (IRI), UTF-8

Abstract

Uniform Resource Identifiers (URIs) are a core component of the Web. Internationalized Resource Identifiers (IRIs) are equivalent to URIs except that they remove the limitation that only a subset of us-ascii can be used. Conversion between IRIs and URIs is based on the UTF-8 character encoding followed by %-escaping. This matches well with an increasing number of URI schemes and components that use UTF-8 as their encoding. This paper discusses URI internationalization in detail, including motivation, architecture, specifications, and testing.

1. Introduction

This section discusses the motivation for the internationalization of URIs and gives some basic introduction to URIs, their properties, and their components. Uniform Resource Identifiers (URIs) [RFC2396] are one of the three basic components of the original World Wide Web architecture (the other two being HTTP and HTML). URIs are the glue of the World Wide Web, they are used to identify virtually everything of importance from Web pages and services to email addresses, telnet connections, and telephone calls.

Motivation for Internationalized Resource Identifiers

On average, URIs use a mixture of readable parts and syntax that is cryptic at least at first glance. For example, this paper will be available at http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html. In this example, the reader may be able to correlate several of the components with the date of the talk, the conference name, the title, and so on. Some of these correlations may be wrong, or may go unnoticed. But the experience with URIs over the last few years, as well as with many other kinds of identifiers, shows that there is a continuing desire for people to make use of such correlations. In particular, such correlations are useful for the following purposes [Dür97a]:

Devise identifiers (what is my identifier for X?)
Memorize identifiers (what was the identifier of X?)
Guess identifiers (what could be the identifier of X?)
Understand identifiers (what does X refer to?)
Manipulate identifiers (write, type,...)
Correct identifiers (spelling errors)
Identify with identifiers (nice identifier, isn't it?)

All these operations are much easier if people can use their native script. This is a very clear motivation for making sure URIs are appropriately internationalized. In addition, URIs may contain query parts, where it is important that characters can be sent to the server reliably (see Section 4).

Properties of URIs

The basic property of URIs is that they are identifiers, i.e. they stand for something else. The 'something else' is called the resource, and the process of obtaining the resource is called resolution. URIs have a number of additional important properties. The properties most important in the context of this discussion are uniformity and transcribability. For completeness, this subsection also shortly discusses universality and the distinctions between URLs and URNs.

Uniformity refers to the fact that certain syntactic conventions are associated with certain operations for all URIs. As an example, the characters '#' or '/' always have the same function whenever they appear in an URI. This does not mean that every URI has a '/', or that all URI schemes allow '/', but it guarantees that the operations associated with '/' characters in URIs can be executed uniformly for all URIs. A thourough discussion of the importance of uniformity for the current and future operation of the Web can be found in [Gettys].

Uniformity was in some cases used as an argument against URI internationalization. Using a small and uniform set of characters would allow any URI to be used by anybody, on almost any type of device. However, many URIs are predominantly used by people knowing a particular script, and it is much better to optimize these URIs for these users than to optimize it for the remaining small minority that is not familliar with the script.

While the above discussion applies to the final form of URIs, uniformity is definitely very important when looking at character encoding issues. As Section 2 will show, this unfortunately has not been recognized from the start.

The second important property of URIs in the context of internationalization is that they are not only used inside digital systems as protocol elements, but are also used on paper as well as in people's minds. These kinds of transcriptions are important for URI internationalization in various ways. As said above they are one of the main motivations for internationalizing URIs to the point where a wide range of characters can directly be used.

URIs are also known as Universal Resource Identifiers. This refers to the fact that anything of importance can be given an URI, and any existing system of identifiers can be subsumed by URIs. It does not mean that URIs are the only such system possible, but it is currently the most visible, successful, and widely used one.

URIs are often partitioned into URLs (Uniform Resource Locators) and URNs (Uniform Resource Names). Internationalization of URIs is orthogonal to this distinction, and so only a very short summary is given here. Depending on the context of the discussion, this is done in at least three ways. First, in an abstract sense, there is an attempt to make a distinction between names and addresses (locations). This works very well for physical entities such as human beings or books in a library, but gets heavily blurred in the case of a digital network with numerous indirections and caching mechanisms. Second, in an intentional sense, URNs are often positioned for more persistent use. Third, in a syntactical sense, URNs are distinguished as those URIs that start with the prefix (scheme name) urn:.

URI Components

URI syntax is defined so that various parts of an URI can be clearly identified if present. First, according to [RFC2396], what goes into the href attribute of the <a> element in HTML and similar places is called an URI Reference. This includes the part after the #, the so-called fragment identifier. For [RFC2396] and specifications referring to it, only the part before the # (if present) is actually called URI. In everyday language, the term URI is often used for everything including the fragment identifier; this paper follows this practice because internationalization considerations apply to the fragment identifier without exception.

In a well-defined context (e.g. in a Web page that has its own URI), it is possible to use relative URIs, which can be extremely short. [RFC2396] defines exactly how relative URIs can be converted into absolute URIs. Again, the distinction between relative and absolute URIs is not relevant for internationalization. Below, we will use very short examples, which can be understood to be relative URIs.

The first part of an absolute URI, up to the first colon, is the schema. Well-know schemas include http:, ftp:, and so on. The schema defines both the syntax (within the general limits of the URI syntax) and the semantics of the URIs in this schema, including character encoding.

Overview

The next section discusses character encoding in URIs, from the legacy of undefined character encoding towards the consistent use of UTF-8. Section 3 introduces IRIs as the internationalized equivalent of URIs. Section 4 deals with specific aspecs such as query part internationalization, domain name internationalization, and bidirectionality. Section 5 discusses testing and future work.

2. From Legacy to Consistency

This section discusses the evolution from legacy URI character handling to the use of UTF-8 for consistent URI character handling. For completeness, some other approaches to URI internationalization that have been proposed in the past are also discussed.

Legacy URI Character Handling

Older specifications for URIs [RFC1630] do not clearly distinguish between characters and bytes, and to some extend assume the use of iso-8859-1. With the quick growth of the Web beyond the area covered by iso-8859-1, this assumption became obsolete.

[RFC2396], the specification currently defining URIs, explains how characters get encoded into URIs in Section 2.1. A sequence of original characters (e.g. in a domain name or a file name) is mapped to a sequence of bytes. This sequence of bytes is then mapped to a sequence of URI characters. Both mappings can work both ways.

The second mapping, from bytes to URI characters, is well defined. For the bytes corresponding to a subset of us-ascii, the us-ascii encoding is used. This subset includes all letters and digits and a small number of symbols (called unreserved in [RFC2396]: '-', '_', '.', '!', '~', '*', "'", '(', ')'). For all the other bytes, a '%' followed by two hexadecimal digits is used. This is called %-encoding or %HH-encoding. The escaping also affects all the syntactically relevant characters such as '/', '#', '%', and so on. As an example, a simple % has to be escaped to %25 to clearly distinguish a 'payload' % characters from a % used in a %-escape. It is also possible to escape additional bytes. As an example, an 'a' can always be escaped to %61, although this is done extremely rarely.

Unfortunately, the first mapping, from original characters to bytes, is not well defined. For characters in the us-ascii range, the us-ascii encoding is used, but for other characters, [RFC2396] explicitly leaves the encoding undefined, and defers it to future specifications. The resulting situation is depicted in Table 1.

original characters	`<======>`	bytes	`<======>`	URI characters
	encoding undefined		us-ascii or %HH
March	us-ascii	`4D 61 72 63 68`		`March`
März	iso-8859-1	`4D E4 72 7A`		`M%E4rz`
März	macintosh	`4D 8A 72 7A`		`M%8Arz`
März	utf-8	`4D C3 A4 72 7A`		`M%C3%A4rz`

Table 1: Mapping between original characters and URI chararcters, with examples.

This shows clearly that there is a very strong asymetry between the characters in the US-ASCII range and other characters. For characters in the US-ASCII range, the overall mapping is the identity. From protocol designers to end users, nobody is really aware that there are two mappings; the identity is taken for granted. For other characters, there is a double handicap: they get converted to unreadable escapes, and the encoding used gets lost. When trying to back-convert, one could for example end up with M‰rz or MÃ¤rz.

Please note that there is no requirement that URIs are constructed from original characters. It is also possible to directly start with bytes in the case of digital data. However, the only known URI scheme that allows to directly encode (binary) data, the data: scheme [RFC2397], uses base64 for easier readability and shortness.

Character Encoding in URIs based on UTF-8

The above situation can be improved in two steps. The first step consists in converging on a single encoding. The second step consists in extending the number of characters allowed in URIs. The first step is described here. The second step is described in Section 3. Both steps are designed to be introduced in parallel and to reinforce each other.

Allowing arbitrary encodings for non-ASCII characters in URIs creates unnecessary confusion. Converging on a single encoding is highly desirable. UTF-8 is the encoding of choice for the following main reasons:

UTF-8 covers all characters of Unicode.
UTF-8 is strictly upwards-compatible to us-ascii
UTF-8 is recommended by the IETF in [RFC2277]
UTF-8 is easy to process and more and more widely supported

[RFC2396] leaves details of URI syntax, including the issue of character encoding, to scheme-specific definitions. Some of these in turn leave the choice of encoding to the individual creators of URIs. Given this situation, it was not possible to suddenly declare that UTF-8 should be the only encoding, used in all URIs. [RFC2718], section 2.2.5, however clearly recommends using UTF-8 for new URI schemes.

There are the following ways in which an URI scheme can adopt UTF-8:

By using UTF-8 in the protocol that is associated with the scheme. Using the bytes from the protocol directly in the URI was one of the motivations for originally not fixing the character encoding for URIs. An example of a protocol that uses UTF-8 is the 'ftp:' URI associated with the FTP protocol [RFC2640].
By explicitly converting between another representation used in the protocol and UTF-8 used in the URI. An example for this is the 'imap:' URI [RFC2192].
By just declaring UTF-8 to be used in cases where there is no protocol associated with the URI scheme. An example for this is the 'urn:' scheme [RFC2141]. For URNs, identifier syntax and resolution mechanisms are highly independent.
By creating resources using UTF-8 as the character encoding for the URI resolution. The typical example of this is HTTP. Each server can chose the encoding for each of its resource names. UTF-8 based resource names can be created either by using UTF-8 in the underlying server system (in many cases just a file system), or by doing the appropriate conversions in the server in order to expose UTF-8 rather than the actual back-end encoding.

There are also parts of URIs that are independent of URI schemes, in particular the fragment identifier (the part after the #). Fragment identifiers are separated from URIs before resolution, and applied to the resolved resource depending on its MIME type. The syntax of fragment identifiers is defined by the format used for the resource, e.g. HTML. An syntax for more flexible fragment identifiers is [XPointer], which also is defined to use UTF-8.

3. Internationalized Resource Identifiers

Converging on UTF-8 for the conversion between original characters and URIs is a very important step ahead, but still requires %HH-escaping. The obvious goal is to get rid of %HH-escaping whenever possible, and to just reach the same 'identity conversion' as for us-ascii. For this, the convergence to UTF-8 is an important prerequisite, because otherwise conversion to (traditional) URIs is not clearly defined.

The resulting construct has been called Internationalized URI and Globalized URI, but recently, we have adopted the term Internationalized Resource Identifier (IRI). The change of terminology made it quite a bit easier to describe the concepts, because it was possible to avoid lengthy terms such as non-internationalized URI. However, dropping the 'U' (uniform or universal) does not at all mean that these principles have been dropped; IRIs maintain these principles, and in some sense are actually more uniform and universal. It also does not mean that IRIs should be limited to very special places. IRIs can and should replace URIs wherever possible. The use of two clearly distinct terms makes it easier to describe this replacement in specifications. Whether the general public will ever adopt the term IRI is a different question.

In principle, the definition of IRI is very easy: It is the same as an URI, except that wherever %HH is allowed, non-URI characters are also allowed. As a result, using non-ASCII characters in IRIs becomes as easy and straightforward as using us-ascii characters in URIs. For convenience, the resolution of IRIs is defined via a conversion to URIs. However, this does not mean that an actual conversion to URIs is always needed. Conversion from IRIs to URIs is straightforward. All the characters not allowed in URIs are %HH-escaped after a conversion to bytes based on UTF-8. This is shown in Table 2.

original	`<======>`	bytes	IRI	URI
	encoding		utf-8 or %HH	us-ascii or %HH
March	utf-8	`4D 61 72 63 68`	`March`	`March`
März	utf-8	`4D C3 A4 72 7A`	`März`	`M%C3%A4rz`
März	iso-8859-1	`4D E4 72 7A`	`M%E4rz`	`M%E4rz`
März	macintosh	`4D 8A 72 7A`	`M%8Arz`	`M%8Arz`

Table 2: Original characters, IRIs, and URIs.

Table 2 also shows that IRIs do not exclude URIs based on legacy encodings (last two rows). However, because these URIs do not use UTF-8, %HH-escaping has to be used. There are other cases where %HH-escaping has to be used or can be used; all together, there are the following cases:

To escape syntactix characters when used as 'payload' characters. The syntax characters of IRIs are exactly the same as those of URIs.
To unambiguously denote characters that might otherwise be mistakenly transcribed (e.g. NBSP).
To unambiguously encode code points that cannot be transcribed (e.g. unassigned codepoints, codepoints not in NFC,...).
To escape bytes resulting from legacy encodings (see examples above).
To escape characters when the transport medium does not allow to carry them (e.g. Japanese characters in an iso-8859-1 encoded email). If the transport medium has its own way of escaping such characters, this may be preferred. E.g. in a Japanese Web page (encoded in iso-2022-jp), the above example would be written as März rather than M%C3%A4rz because this conserves the identity of the character a-Umlaut.
To escape any non-syntactic character if desired (same as escaping 'a' to %61 for URIs). This can be used to get around limitations of devices, or to make sure anybody can transcribe the URI even if he or she is not familiar with the script used.

IRI Specification Details

While the basic idea of IRIs is extremely simple, there are some details that have to be addressed carefully. These include digital transport of IRIs, normalization issues, and conversion from URIs to IRIs.

Of course IRIs are designed to be transported digitally. One important detail which may be easy to overlook in this case is that IRIs, in the same way as URIs, are sequences of characters, not sequences of bytes. This is obvious when IRIs appear on paper or other physical transport media. For digital formats and protocols, it means that the character encoding of IRIs follows the character encoding conventions of the format or protocol in question. To take a particular example, an email body part or Web page encoded in iso-8859-1 would use the E4 byte to encode the character 'ä' in the IRI März, in the same way it uses this byte for all the other 'ä' characters. Conversion to UTF-8 only occurs when converting the IRI to an URI or when directly passing the IRI to the resolution API (assuming that API uses UTF-8 and not e.g. UTF-16).

Unicode allows multiple encoding variants for certain characters. For example, many characters with diacritics can be represented both in precomposed and in decomposed form. On paper and similar transport media, there is no difference. When converting from a form that does not make such destinctions to a form where these distinctions are relevant, a particular encoding must be chosen consistently. For many reasons outlined in [NormReq], the form to choose is Normalization Form C (NFC) [UTF15].

While conversion from IRIs to URIs is quite straghtforward, the reverse conversion is more difficult. The main problem is that it is not clear whether some %HH-escaping sequence was the result of encoding some characters using UTF-8 or whether it was produced otherwise. However, UTF-8 byte sequences are highly regular and very rarely coincide with byte sequences from legacy encodings (see [Dür97]). Also, just testing against UTF-8 byte patterns is not sufficient. Before converting back, various other conditions have to be checked against. These include non-allocated code points, non-normalized code sequences, and characters not suitable in IRIs, such as formating characters and various kinds of spaces and compatibility characters.

Specifications using IRIs

Over the last years, a number of core Web specifications have adopted what now is called IRI. The most important ones are:

[HTML4], Section B.2.1, described a convention for what to do with Non-ASCII characters in URI attributes. This is worded as an implementation notice to align error behaviour.
[XML]: Section 4.2.2, including erratum E4 (see http://www.w3.org/XML/xml-V10-2e-errata#E4), specifies that system identifiers in XML, used for references to external DTD subsets and entities, are IRIs.
[XLink] Section 5.4, defines that the xlink:href attribute that is used for all XLink links is an IRI.
[XML Schema] Section 3.2.17, defines the anyURI datatype as including IRI functionality. The 'any' prefix covers both the full spectrum of URI Syntax including URI References as well as non-ASCII characters.
[CharMod], still in draft stage, requires the use of IRIs in the place of URIs for all future W3C specifications.

As [IRI] is still only a draft, these specifications include explicit definitions of IRI behavior. The following quote shows how this is done in [XLink]; other specifications contain very similar provisions.

Some characters are disallowed in URI references, even if they are allowed in XML; the disallowed characters include all non-ASCII characters, plus the excluded characters listed in Section 2.4 of [IETF RFC 2396], except for the number sign (#) and percent sign (%) and the square bracket characters re-allowed in [IETF RFC 2732]. Disallowed characters must be escaped as follows:

Each disallowed character is converted to UTF-8 [IETF RFC 2279] as one or more bytes.

Any bytes corresponding to a disallowed character are escaped with the URI escaping mechanism (that is, converted to %HH, where HH is the hexadecimal notation of the byte value).

The original character is replaced by the resulting character sequence.

One remaining problem is the treatment of those us-ascii characters that are not allowed in URIs, i.e. space and various delimiters such as '<', '>', '{', '}',... The current wording in [IRI] excludes them, but most of the specifications above actually allow them. It is expected that [IRI] will be updated to allow them, but will contain a strong warning against using them. The reasons for allowing them are that many actual implementations already tolerate them, and that in some contexts, in particular in XML attributes, many of them do not cause any problems, while other characters such as '&' actually have to be escaped (not with URI %HH-escaping, but with XML escaping such as &). The reason for a strong warning is that IRIs that contain these characters cannot easily be transferred from one context to another.

Conditions for Using IRIs

There are two groups of conditions for the use of IRIs. The first group comprises basic operational conditions to deal with non-ASCII characters on the computer, and includes keyboard and display logic. The second group contains the two conditions necessary for IRIs to actually work in a particular context:

The URI corresponding to the IRI has to be based on UTF-8, because an appropriate scheme is used or the URI is constructed that way (e.g. via server settings).
The carrier format for the IRI has to allow IRIs in the particular location (e.g. XML attribute,...).

4. Specific Aspects of URI Internationalization

Forms and Query Parts

HTML Forms are an important part of interactivity and client-server communication on the Web. The most frequent way of sending information back to the server after filling in a form is by appending it to the request URI after a '?', in the so-called query part.

Internationalization of query parts is very important, because form character data naturally contains non-ascii characters. Unfortunately, because character encoding of URIs was not originally specified, various heuristics developed:

Sending the data back in the encoding of the page received. This is what [HTML4] specifies, and is implemented by most browsers already for some generations. It works if the page is not transcoded to another character encoding on the way to the browser (true nowadays for PCs but not necessarily for mobile phones). A precondition is that the encoding of the page is correctly detected; to help this, it is a good idea to make sure that there is some relevant non-ASCII text on the page.
Using a hidden field to indicate the encoding that was used to send the page out. This supports the above behaviour if the page is generated dynamically in different encodings, but does not protect against transcoding.
Using a hidden field with a specific value. If the page is transcoded, this value will also be transcoded, and so it is possible to trace well-known transcodings.
Using a field with a specific name that the browser will use to return the encoding of the query part. This would be close to ideal, but is not widely deployed.
Guessing the encoding based on return values, by analysing the returned bit patterns and the information about the browser sent in the User-Agent: header field. This is very difficult, but sometimes the only solution.

Sending out a page in UTF-8 alleviates many problems with Web internationalisation, including this the problem of query parts. [XForms], the new generation of forms currently under development at W3C, provides hope for a final solution to all the above problems.

Domain Names

Many kinds of URIs can contain domain names (for example www.example.com). Currently, domain names only allow a very restricted set of characters, a subset of the characters allowed in URIs. The IETF IDN Working Group [IDN] is working on a solution to allow a large number of characters from Unicode in domain names. The solution that currently has the most support for use inside the domain name system itself is a mapping of Unicode characters back to ASCII characters, a so called ACE (Ascii Compatible Encoding).

Independent of the solution chosen for the domain name system itself, uniformity of character representation is important for domain name components in URIs and IRIs, too. [IDN-URI] therefore proposes to adopt UTF-8 followed by %HH escaping for the domain name parts of URIs, which means that in IRIs, domain names from all kinds of scripts can be used naturally. [IDN-URI] also extends the generic URI syntax of [RFC2396] to allow %HH-escapes in the domain name part.

Bidirectionality

Scripts such as Arabic and Hebrew are written from right to left. Combined with letters from other scritps or digits, this leads to the problem of mixed directionality or bidirectionality. Bidirectionality for URIs and IRIs is a difficult problem due to three contradicting requirements:

The principle of uniformity for URIs and IRIs requires that components and characters from right-to-left scripts be stored in in the same order as those from left-to-right scritps for processing, i.e. in logical order.
The fact that IRIs are transcribed visually and for this to work, the visual order of bidirectional components and characters has to be clearly defined.
The Unicode bidirectional reordering algorithm (converting from logical ordered backing store to visual display) works well for natural text. But IRI syntax is different from natural text and therefore need special precautions.

[Atkin] proposes a reordering algorithm for domain names that takes a good balance between liminations on character combinations (e.g. both Latin and Arabic in the same component) and complexity. [IRI-bidi] explains how to use bidi control characters to bridge the gap between an IRI-specific algorithm and the Unicode algorithm. A combination of ideas from both proposals may lead to the best solution under the given boundary conditions.

5. Testing and Future Work

The previous chapters have shown how to move towards a consistent and user-friendly architecture for internationalized URIs. IRIs allow to use a wide range of characters directly with the same syntax as URIs. Conversion to URIs is done by encoding in UTF-8 and then using %HH-escaping as necessary. This fits together with the adopition of UTF-8 for more and more URI schemes.

Various experiences with W3C specifications over the last few years has shown that test suites can be a very efficient way to improve specifications and their implementation and deployment. We are therefore currently working on some test for IRIs. A first version is publicly available at http://www.w3.org/2001/08/iri-test. Tests will include documents conforming to different specifications (HTML, XML, XML Schema,...) in different encodings. Each test will test the functionality of a particular IRI in the document. Some tests will be added to make sure that the basic functionality for the test is available and that the test is executed correctly. These will include both tests using us-ascii only as well as basic tests for each of the encodings used. A first version of the tests only available to W3C members already showed some encouraging results.

Other future work in particular includes moving the various specifications currently in draft stage further along the W3C or IETF specification process.

Acknowledgements

All opinions and errors in this paper are purely those of the author. There are many people who have to be acknowledged for URI internatinalization, too many to list them all. The main thanks go to François Yergeau, who had the idea both for using UTF-8 and for how to address bidi problems, and to Larry Masinter, for providing both help as well as creative pushback.

References

RFCs are available at many other locations, among else from http://www.ietf.org/rfc/rfcNNNN.txt, where NNNN is the RFC number. Internet-Drafts are work in progress and are frequently updated. They can be found at http://www.ietf.org/internet-drafts/xxxx, where xxxx is the name of the draft. Please check whether there is a new version, with a higher sequence number (e.g. -08.txt in place of -07.txt).

[Atkin]: Steve Atkin, Bidirectionality and Domain Names, Proc. 19th International Unicode Conference, San Jose, CA, Sept. 2001.
[CharMod]: Martin J. Dürst, François Yergeau, Asmus Freytag, ASMUS, and Tex Texin, Character Model for the World Wide Web 1.0, W3C Working Draft January 2001, newest version available from http://www.w3.org/TR/charmod.
[Dür97]: Martin J. Dürst, The Properties and Promises of UTF-8, Proc. 11th International Unicode Conference, San Jose, CA, Sept. 1997, also available as http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.
[Dür97a]: Martin J. Dürst, Internationalizing Internet Identifiers, Proc. 11th International Unicode Conference, San Jose, CA, Sept. 1997, also available as http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-URI.pdf.
[Gettys]: Jim Gettys, URI Model Conseqences, at http://www.w3.org/DesignIssues/ModelConsequences.
[HTML4]: Dave Raggett, Arnaud Le Hors, and Ian Jacobs, HTML 4.01 Specification, W3C Recommendation, December 1999, available at http://www.w3.org/TR/html401.
[IDN]: IETF Internationalized Domain Name (idn) Working Group. Information at http://www.ietf.org/html.charters/idn-charter.html.
[IDN-URI]: Martin Dürst, Internationalized Domain Names in URIs and IRIs, Internet-Draft draft-ietf-idn-uri-00.txt, January 2001.
[IRI]: Larry Masinter and Martin Dürst, Internationalized Resource Identifiers (IRI), Internet-Draft draft-masinter-url-i18n-07.txt, January 2001.
[IRI-bidi]: Martin Dürst, Internet Identifiers and Bidirectionality, Internet-Draft draft-duerst-iri-bidi-00.txt, July 2001.
[NormReq]: Martin Dürst, Requirements for String Identity Matching and String Indexing, W3C Working Draft, July 1998, available at http://www.w3.org/TR/WD-charreq.
[RFC1630]: Tim Berners-Lee, Universal Resource Identifiers in WWW, RFC 1630, June 1994.
[RFC2141]: R. Moats, URN Syntax, RFC 2141, May 1997.
[RFC2192]: C. Newman, IMAP URL Scheme, RFC 2192, September 1997.
[RFC2277]: H. Alvestrand, IETF Policy on Character Sets and Languages, RFC 2277, BCP 18, January 1998.
[RFC2396]: T. Berners-Lee, R. Fielding, and L. Masinter, Uniform Resource Identifiers (URI): Generic Syntax, RFC 2396, August 1998.
[RFC2397]: L. Masinter, The "data" URL scheme, RFC 2397, August 1998.
[RFC2640]: B. Curtin, Internationalization of the File Transfer Protocol, RFC 2640, July 1999.
[RFC2718]: L. Masinter, H. Alvestrand, D. Zigmond, and R. Petke, Guidelines for new URL Schemes, RFC 2718, November 1999.
[UTF15]: Mark Davis and Martin Dürst, Unicode Standard Annex #15: Unicode Normalization Forms, March 2001, available at http://www.unicode.org/unicode/reports/tr15.
[XForms]: Micah Dubinko et al., Eds., XForms 1.0, W3C Working Draft, June 2001, newest version available at http://www.w3.org/TR/xforms.
[XLink]: Steve DeRose, Eve Maler, and David Orchard, Eds., XML Linking Language (XLink) Version 1.0, W3C Recommendation, June 2001, available at http://www.w3.org/TR/xlink.
[XML]: Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, and Eve Maler, Eds., Extensible Markup Language (XML) 1.0 (Second Edition), W3C Recommendation October 2000 (First edition February 1998; errata at http://www.w3.org/XML/xml-V10-2e-errata), available at http://www.w3.org/TR/REC-xml.
[XMLSchema]: Paul V. Biron and Ashok Malhotra, XML Schema Part 2: Datatypes, W3C Recommendation, May 2001, available at http://www.w3.org/TR/xmlschema-2.
[XPointer]: Steve DeRose, Eve Maler, and Ron Daniel Jr., Eds., XML Pointer Language (XPointer) Version 1.0, W3C Last Call Working Draft, January 2001, newest version available from http://www.w3.org/TR/xptr.

Internationalized Resource Identifiers: From Specification to Testing