draft-duerst-iri-10.txt   draft-duerst-iri.txt 
Network Working Group M. Duerst Network Working Group M. Duerst
Internet-Draft W3C Internet-Draft W3C
Expires: March 28, 2005 M. Suignard Expires: May 31, 2005 M. Suignard
Microsoft Corporation Microsoft Corporation
September 27, 2004 November 30, 2004
Internationalized Resource Identifiers (IRIs) Internationalized Resource Identifiers (IRIs)
draft-duerst-iri-10 draft-duerst-iri-11
Status of this Memo Status of this Memo
This document is an Internet-Draft and is subject to all provisions This document is an Internet-Draft and is subject to all provisions
of section 3 of RFC 3667. By submitting this Internet-Draft, each of section 3 of RFC 3667. By submitting this Internet-Draft, each
author represents that any applicable patent or other IPR claims of author represents that any applicable patent or other IPR claims of
which he or she is aware have been or will be disclosed, and any of which he or she is aware have been or will be disclosed, and any of
which he or she become aware will be disclosed, in accordance with which he or she become aware will be disclosed, in accordance with
RFC 3668. RFC 3668.
skipping to change at page 1, line 37 skipping to change at page 1, line 37
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on March 28, 2005. This Internet-Draft will expire on May 31, 2005.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2004). Copyright (C) The Internet Society (2004).
Abstract Abstract
This document defines a new protocol element, the Internationalized This document defines a new protocol element, the Internationalized
Resource Identifier (IRI), as a complement to the Uniform Resource Resource Identifier (IRI), as a complement to the Uniform Resource
Identifier (URI). An IRI is a sequence of characters from the Identifier (URI). An IRI is a sequence of characters from the
skipping to change at page 2, line 16 skipping to change at page 2, line 16
of extending or changing the definition of URIs, to allow a clear of extending or changing the definition of URIs, to allow a clear
distinction and to avoid incompatibilities with existing software. distinction and to avoid incompatibilities with existing software.
Guidelines for the use and deployment of IRIs in various protocols, Guidelines for the use and deployment of IRIs in various protocols,
formats, and software components that now deal with URIs are formats, and software components that now deal with URIs are
provided. provided.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Overview and Motivation . . . . . . . . . . . . . . . . . 4 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . 4
1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 6
2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 7 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 7
2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . 8 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . 8
3. Relationship between IRIs and URIs . . . . . . . . . . . . . . 11 3. Relationship between IRIs and URIs . . . . . . . . . . . . . . 10
3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . 11 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . 11
3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . 14 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . 14
3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . 15
4. Bidirectional IRIs for Right-to-left Languages . . . . . . . . 17 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . . 17
4.1 Logical Storage and Visual Presentation . . . . . . . . . 17 4.1 Logical Storage and Visual Presentation . . . . . . . . . 17
4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . 19 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . 18
4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . 20 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . 20
4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 20
5. IRI Equivalence and Comparison . . . . . . . . . . . . . . . . 22 5. Normalization and Comparison . . . . . . . . . . . . . . . . . 22
5.1 Simple String Comparison . . . . . . . . . . . . . . . . . 22 5.1 Equivalence . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . 23 5.2 Preparation for Comparison . . . . . . . . . . . . . . . . 23
5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . 23 5.3 Comparison Ladder . . . . . . . . . . . . . . . . . . . . 23
5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . 24 5.3.1 Simple String Comparison . . . . . . . . . . . . . . . 24
6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.3.2 Syntax-based Normalization . . . . . . . . . . . . . . 25
6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . 25 5.3.3 Scheme-based Normalization . . . . . . . . . . . . . . 27
6.2 Software Interfaces and Protocols . . . . . . . . . . . . 25 5.3.4 Protocol-based Normalization . . . . . . . . . . . . . 29
6.3 Format of URIs and IRIs in Documents and Protocols . . . . 26 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.4 Use of UTF-8 for Encoding Original Characters . . . . . . 26 6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . 29
6.5 Relative IRI References . . . . . . . . . . . . . . . . . 28 6.2 Software Interfaces and Protocols . . . . . . . . . . . . 30
7. URI/IRI Processing Guidelines (informative) . . . . . . . . . 28 6.3 Format of URIs and IRIs in Documents and Protocols . . . . 30
7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . 28 6.4 Use of UTF-8 for Encoding Original Characters . . . . . . 30
7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 28 6.5 Relative IRI References . . . . . . . . . . . . . . . . . 32
7.3 URI/IRI Transfer Between Applications . . . . . . . . . . 29 7. URI/IRI Processing Guidelines (informative) . . . . . . . . . 32
7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 30 7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . 32
7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 30 7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 33
7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 31 7.3 URI/IRI Transfer Between Applications . . . . . . . . . . 34
7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . 31 7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 34
7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 32 7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 35
8. Security Considerations . . . . . . . . . . . . . . . . . . . 33 7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 35
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 34 7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . 36
10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 34 7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 36
11. References . . . . . . . . . . . . . . . . . . . . . . . . . 35 8. Security Considerations . . . . . . . . . . . . . . . . . . . 37
11.1 Normative References . . . . . . . . . . . . . . . . . . . . 35 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 39
11.2 Non-normative References . . . . . . . . . . . . . . . . . . 36 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 39
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 38 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 39
A. Design Alternatives . . . . . . . . . . . . . . . . . . . . . 39 11.1 Normative References . . . . . . . . . . . . . . . . . . . . 39
A.1 New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 39 11.2 Non-normative References . . . . . . . . . . . . . . . . . . 41
A.2 Other Character Encodings than UTF-8 . . . . . . . . . . . 40 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 43
A.3 New Encoding Convention . . . . . . . . . . . . . . . . . 40 A. Design Alternatives . . . . . . . . . . . . . . . . . . . . . 43
A.4 Indicating Character Encodings in the URI/IRI . . . . . . 40 A.1 New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 43
Intellectual Property and Copyright Statements . . . . . . . . 41 A.2 Other Character Encodings than UTF-8 . . . . . . . . . . . 44
A.3 New Encoding Convention . . . . . . . . . . . . . . . . . 44
A.4 Indicating Character Encodings in the URI/IRI . . . . . . 44
Intellectual Property and Copyright Statements . . . . . . . . 45
1. Introduction 1. Introduction
1.1 Overview and Motivation 1.1 Overview and Motivation
A Uniform Resource Identifier (URI) is defined in [RFCYYYY] as a A Uniform Resource Identifier (URI) is defined in [RFCYYYY] as a
sequence of characters chosen from a limited subset of the repertoire sequence of characters chosen from a limited subset of the repertoire
of US-ASCII [ASCII] characters. of US-ASCII [ASCII] characters.
The characters in URIs are frequently used for representing words of The characters in URIs are frequently used for representing words of
skipping to change at page 4, line 45 skipping to change at page 4, line 45
[RFCYYYY], such as URI references. The syntax of IRIs is defined in [RFCYYYY], such as URI references. The syntax of IRIs is defined in
Section 2, and the relationship between IRIs and URIs in Section 3. Section 2, and the relationship between IRIs and URIs in Section 3.
Using characters outside of A-Z in IRIs brings with it some Using characters outside of A-Z in IRIs brings with it some
difficulties. Section 4 discusses the special case of bidirectional difficulties. Section 4 discusses the special case of bidirectional
IRIs, Section 5 various forms of equivalence between IRIs, and IRIs, Section 5 various forms of equivalence between IRIs, and
Section 6 the use of IRIs in different situations. Section 7 gives Section 6 the use of IRIs in different situations. Section 7 gives
additional informative guidelines, and Section 8 security additional informative guidelines, and Section 8 security
considerations. considerations.
For discussion of this document, please use the public-iri@w3.org
mailing list (publicly archived at
http://lists.w3.org/Archives/Public/public-iri/). An issues list for
this document is maintained at
http://www.w3.org/International/iri-edit#issues. For more
information on the topic of this document, please also see [W3CIRI]
and [Duerst01].
1.2 Applicability 1.2 Applicability
IRIs are designed to be compatible with recommendations for new URI IRIs are designed to be compatible with recommendations for new URI
schemes [RFC2718]. The compatibility is provided by specifying a schemes [RFC2718]. The compatibility is provided by specifying a
well defined and deterministic mapping from the IRI character well defined and deterministic mapping from the IRI character
sequence to the functionally equivalent URI character sequence. sequence to the functionally equivalent URI character sequence.
Practical use of IRIs (or IRI references) in place of URIs (or URI Practical use of IRIs (or IRI references) in place of URIs (or URI
references) depends on the following conditions being met: references) depends on the following conditions being met:
a) The protocol or format element where IRIs are used should be a) The protocol or format element where IRIs are used should be
skipping to change at page 11, line 5 skipping to change at page 10, line 44
/ "25" %x30-35 ; 250-255 / "25" %x30-35 ; 250-255
pct-encoded = "%" HEXDIG HEXDIG pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
reserved = gen-delims / sub-delims reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "=" / "*" / "+" / "," / ";" / "="
This syntax does not support IPv6 scoped addressing zone identifiers.
3. Relationship between IRIs and URIs 3. Relationship between IRIs and URIs
IRIs are meant to replace URIs in identifying resources for IRIs are meant to replace URIs in identifying resources for
protocols, formats and software components which use a UCS-based protocols, formats and software components which use a UCS-based
character repertoire. These protocols and components may never need character repertoire. These protocols and components may never need
to use URIs directly, especially when the resource identifier is used to use URIs directly, especially when the resource identifier is used
simply for identification purposes. However, when the resource simply for identification purposes. However, when the resource
identifier is used for resource retrieval, it is in many cases identifier is used for resource retrieval, it is in many cases
necessary to determine the associated URI because most retrieval necessary to determine the associated URI because most retrieval
mechanisms currently only are defined for URIs. In this case, IRIs mechanisms currently only are defined for URIs. In this case, IRIs
skipping to change at page 12, line 12 skipping to change at page 12, line 7
characters from the UCS normalized according to Normalization characters from the UCS normalized according to Normalization
Form C (NFC, [UTR15]). Form C (NFC, [UTR15]).
Variant B) If the IRI is in some digital representation (e.g. an Variant B) If the IRI is in some digital representation (e.g. an
octet stream) in some known non-Unicode character encoding: octet stream) in some known non-Unicode character encoding:
Convert the IRI to a sequence of characters from the UCS Convert the IRI to a sequence of characters from the UCS
normalized according to NFC. normalized according to NFC.
Variant C) If the IRI is in an Unicode-based character encoding Variant C) If the IRI is in an Unicode-based character encoding
(for example UTF-8 or UTF-16): Do not normalize (see Section (for example UTF-8 or UTF-16): Do not normalize (see Section
5.3 for details). Apply Step 2 directly to the encoded Unicode 5.3.2.2 for details). Apply Step 2 directly to the encoded
character sequence. Unicode character sequence.
Step 2) For each character in 'ucschar' or 'iprivate', apply Steps Step 2) For each character in 'ucschar' or 'iprivate', apply Steps
2.1 through 2.3 below. 2.1 through 2.3 below.
2.1) Convert the character to a sequence of one or more octets 2.1) Convert the character to a sequence of one or more octets
using UTF-8 [RFC3629]. using UTF-8 [RFC3629].
2.2) Convert each octet to %HH, where HH is the hexadecimal 2.2) Convert each octet to %HH, where HH is the hexadecimal
notation of the octet value. Note that this is identical to notation of the octet value. Note that this is identical to
the percent-encoding mechanism in Section 2.1 of [RFCYYYY]. To the percent-encoding mechanism in Section 2.1 of [RFCYYYY]. To
skipping to change at page 22, line 14 skipping to change at page 22, line 8
Depending on whether the upper-case letters represent Arabic or Depending on whether the upper-case letters represent Arabic or
Hebrew, the visual representation is different. Hebrew, the visual representation is different.
Example 10 (allowed, but not recommended): Example 10 (allowed, but not recommended):
logical representation: http://ab.CDEFGH.123/kl/mn/op.html logical representation: http://ab.CDEFGH.123/kl/mn/op.html
visual representation: http://ab.123.HGFEDC/kl/mn/op.html visual representation: http://ab.123.HGFEDC/kl/mn/op.html
Components consisting of only numbers are allowed (it would be rather Components consisting of only numbers are allowed (it would be rather
difficult to prohibit them), but may interact with adjacent RTL difficult to prohibit them), but may interact with adjacent RTL
components in ways that are not easy to predict. components in ways that are not easy to predict.
5. IRI Equivalence and Comparison 5. Normalization and Comparison
This section discusses IRI Equivalence and Comparison similar to Note: The structure and much of the material for this section is
Section 6, "Normalization and Comparison", in [RFCYYYY]. This taken from section 6 of [RFCYYYY]; the differences are due to the
section focuses on the main issues and on aspects that are different specifics of IRIs.
from [RFCYYYY]; Section 6 of [RFCYYYY] is recommended background
reading.
There is no general rule or procedure to decide whether two arbitrary One of the most common operations on IRIs is simple comparison:
IRIs are equivalent or not (i.e. whether they refer to the same determining if two IRIs are equivalent without using the IRIs or the
resource or not). Two IRIs that look almost the same may refer to mapped URIs to access their respective resource(s). A comparison is
different resources. Two IRIs that look completely different may performed every time a response cache is accessed, a browser checks
refer to the same resource. Each specification or application that its history to color a link, or an XML parser processes tags within a
uses IRIs has to decide on the appropriate criterion for IRI namespace. Extensive normalization prior to comparison of IRIs may
equivalence. be used by spiders and indexing engines to prune a search space or
reduce duplication of request actions and response storage.
5.1 Simple String Comparison IRI comparison is performed in respect to some particular purpose,
and implementations with differing purposes will often be subject to
differing design trade-offs in regards to how much effort should be
spent in reducing aliased identifiers. This section describes a
variety of methods that may be used to compare IRIs, the trade-offs
between them, and the types of applications that might use them.
In some scenarios a definite answer to the question of IRI 5.1 Equivalence
equivalence is needed that is independent of the scheme used and
always can be calculated quickly and without accessing a network. An
example of such a case is XML Namespaces ([XMLNamespace]). In such
cases, two IRIs SHOULD be defined as equivalent if and only if they
are character-by-character equivalent. This is the same as being
byte-by-byte equivalent if the character encoding for both IRIs is
the same. As an example,
http://example.org/~user, http://example.org/%7euser, and
http://example.org/%7Euser are not equivalent under this definition.
When comparing character-by-character, the comparison function MUST
NOT map IRIs to URIs, because such a mapping would create additional
spurious equivalences.
It follows that IRIs SHOULD NOT be modified when being transported if Since IRIs exist to identify resources, presumably they should be
there is any chance that this IRI might be used as an identifier in considered equivalent when they identify the same resource. However,
the way explained above. When an IRI is used as an identifier in such a definition of equivalence is not of much practical use, since
scenarios that depend upon character-by-character equivalence, there is no way for an implementation to compare two resources that
creators of IRIs should take additional care to avoid IRIs that only are not under its own control. For this reason, determination of
differ in their use of percent-escaping. As an example, using both equivalence or difference of IRIs is based on string comparison,
http://example.org/~user and http://example.org/%7Euser to identify perhaps augmented by reference to additional rules provided by URI
XML Namespaces is a bad idea. scheme definitions. We use the terms "different" and "equivalent" to
describe the possible outcomes of such comparisons, but there are
many applicationdependent versions of equivalence.
5.2 Conversion to URIs Even though it is possible to determine that two IRIs are equivalent,
IRI comparison is not sufficient to determine if two IRIs identify
different resources. For example, an owner of two different domain
names could decide to serve the same resource from both, resulting in
two different IRIs. Therefore, comparison methods are designed to
minimize false negatives while strictly avoiding false positives.
For actual resolution, differences in percent-encoding (except for In testing for equivalence, applications should not directly compare
the percent-encoding of reserved characters) MUST always result in relative references; the references should be converted to their
the same resource. For example, http://example.org/~user, respective target IRIs before comparison. When IRIs are being
http://example.org/%7euser and http://example.org/%7Euser must compared for the purpose of selecting (or avoiding) a network action,
resolve to the same resource. such as retrieval of a representation, fragment components (if any)
should be excluded from the comparison.
If this kind of equivalence is to be tested, the percent-encoding of Applications using IRIs as identity tokens with no relationship to a
both IRIs to be compared has to be aligned, for example by converting protocol MUST use the Simple String Comparison (see Section 5.3.1).
both IRIs to URIs (see Section 3.1), eliminating escape differences All other applications MUST select one of the comparison practices
in the resulting URIs, and making sure that the case of the from the Comparison Ladder (see Section 5.3, or, after IRI-to-URI
hexadecimal characters in the percent-encoding is always the same conversion, select one of the comparison practices from the URI
(preferably upper case). If the IRI is to be passed to another comparison ladder [RFCYYYY], Section 6.2.
application, or used further in some other way, its original form
MUST be preserved; the conversion described here should be performed
only for the purpose of local comparison.
Additional, similar equivalences are possible based on knowledge 5.2 Preparation for Comparison
about the generic URI/IRI syntax, such as the fact that the scheme
part is case-insensitive.
5.3 Normalization Any kind of IRI comparison REQUIRES that all escapings or encodings
in the protocol or format that carries an IRI are resolved. This is
usually done when parsing the protocol or format. Examples of such
escapings or encodings are entities and numeric character references
in [HTML4] and [XML1]. As an example, http://example.org/rosé
(in HTML), http://example.org/rosé (in HTML or XML), and
http://example.org/rosé (in HTML or XML) all get resolved into
what is denoted in this document (see Section 1.4) as
http://example.org/rosé (the "é" here standing for the
actual e-acute character, to compensate for the fact that this
document cannot contain non-ASCII characters).
Similar considerations apply to encodings such as Transfer Codings in
HTTP (see [RFC2616]) and Content Transfer Encodings in MIME[RFC2045],
although in these cases, the encoding is not based on characters, but
on octets, and additional care is required to make sure that
characters, and not just arbitrary octets, are compared (see Section
5.3.1).
5.3 Comparison Ladder
A variety of methods are used in practice to test IRI equivalence.
These methods fall into a range, distinguished by the amount of
processing required and the degree to which the probability of false
negatives is reduced. As noted above, false negatives cannot be
eliminated. In practice, their probability can be reduced, but this
reduction requires more processing and is not cost-effective for all
applications.
If this range of comparison practices is considered as a ladder, the
following discussion will climb the ladder, starting with those
practices that are cheap but have a relatively higher chance of
producing false negatives, and proceeding to those that have higher
computational cost and lower risk of false negatives.
5.3.1 Simple String Comparison
If two IRIs, considered as character strings, are identical, then it
is safe to conclude that they are equivalent. This type of
equivalence test has very low computational cost and is in wide use
in a variety of applications, particularly in the domain of parsing
and when a definitive answer to the question of IRI equivalence is
needed that is independent of the scheme used and can be calculated
quickly and without accessing a network. An example of such a case
is XML Namespaces ([XMLNamespace]).
Testing strings for equivalence requires some basic precautions.
This procedure is often referred to as "bit-for-bit" or
"byte-for-byte" comparison, which is potentially misleading. Testing
of strings for equality is normally based on pairwise comparison of
the characters that make up the strings, starting from the first and
proceeding until both strings are exhausted and all characters found
to be equal, a pair of characters compares unequal, or one of the
strings is exhausted before the other.
Such character comparisons require that each pair of characters be
put in comparable encoding form. For example, should one IRI be
stored in a byte array in UTF-8 encoding form, and the second be in a
UTF-16 encoding form, bit-for-bit comparisons applied naively will
produce errors. It is better to speak of equality on a
character-for-character rather than byte-for-byte or bit-for-bit
basis. In practical terms, character-by-character comparisons should
be done codepoint-by-codepoint after conversion to a common character
encoding form. When comparing character-by-character, the comparison
function MUST NOT map IRIs to URIs, because such a mapping would
create additional spurious equivalences. It follows that IRIs SHOULD
NOT be modified when being transported if there is any chance that
this IRI might be used as an identifier.
False negatives are caused by the production and use of IRI aliases.
Unnecessary aliases can be reduced, regardless of the comparison
method, by consistently providing IRI references in an
already-normalized form (i.e., a form identical to what would be
produced after normalization is applied, as described below).
Protocols and data formats often choose to limit some IRI comparisons
to simple string comparison, based on the theory that people and
implementations will, in their own best interest, be consistent in
providing IRI references, or at least consistent enough to negate any
efficiency that might be obtained from further normalization.
5.3.2 Syntax-based Normalization
Implementations may use logic based on the definitions provided by
this specification to reduce the probability of false negatives.
Such processing is moderately higher in cost than
character-for-character string comparison. For example, an
application using this approach could reasonably consider the
following two IRIs equivalent:
example://a/b/c/%7Bfoo%7D/rosé
eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9
Web user agents, such as browsers, typically apply this type of IRI
normalization when determining whether a cached response is
available. Syntax-based normalization includes such techniques as
case normalization, character normalization, percent-encoding
normalization, and removal of dot-segments.
5.3.2.1 Case Normalization
For all IRIs, the hexadecimal digits within a percent-encoding
triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
should be normalized to use uppercase letters for the digits A-F.
When an IRI uses components of the generic syntax, the component
syntax equivalence rules always apply; namely, that the scheme and
US-ASCII only host are case-insensitive and therefore should be
normalized to lowercase. For example, the URI
<HTTP://www.EXAMPLE.com/> is equivalent to <http://www.example.com/>.
Case equivalence for non-ASCII characters in IRI components that are
IDNs are discussed in Section 5.3.3. The other generic syntax
components are assumed to be case-sensitive unless specifically
defined otherwise by the scheme.
Creating schemes that allow case-insensitive syntax components
containing non US-ASCII characters should be avoided because such a
case normalization may be cultural dependant and is always a complex
operation. The only exception concerns non-ASCII host names for
which the character normalization includes a mapping step derived
from case folding.
5.3.2.2 Character Normalization
The Unicode Standard [UNIV4] defines various equivalences between The Unicode Standard [UNIV4] defines various equivalences between
sequences of characters for various purposes. Unicode Standard Annex sequences of characters for various purposes. Unicode Standard Annex
#15 [UTR15] defines various Normalization Forms for these #15 [UTR15] defines various Normalization Forms for these
equivalences, in particular Normalization Form C (NFC, Canonical equivalences, in particular Normalization Form C (NFC, Canonical
Decomposition, followed by Canonical Composition) and Normalization Decomposition, followed by Canonical Composition) and Normalization
Form KC (NFKC, Compatibility Decomposition, followed by Canonical Form KC (NFKC, Compatibility Decomposition, followed by Canonical
Composition). Composition).
Equivalence of IRIs MUST rely on the assumption that IRIs are Equivalence of IRIs MUST rely on the assumption that IRIs are
appropriately pre-normalized, rather than applying normalization when appropriately pre-character-normalized, rather than applying
comparing two IRIs. The exceptions are conversion from a non-digital character normalization when comparing two IRIs. The exceptions are
form, and conversion from a non-UCS-based character encoding to an conversion from a non-digital form, and conversion from a
UCS-based character encoding. In these cases, NFC or a normalizing non-UCS-based character encoding to an UCS-based character encoding.
transcoder using NFC MUST be used for interoperability. To avoid In these cases, NFC or a normalizing transcoder using NFC MUST be
false negatives and problems with transcoding, IRIs SHOULD be created used for interoperability. To avoid false negatives and problems
using NFC. Using NFKC may avoid even more problems, for example by with transcoding, IRIs SHOULD be created using NFC. Using NFKC may
choosing half-width Latin letters instead of full-width, and avoid even more problems, for example by choosing half-width Latin
full-width Katakana instead of half-width. letters instead of full-width, and full-width Katakana instead of
half-width.
As an example, http://www.example.org/r&#xE9;sum&#xE9;.html (in XML As an example, http://www.example.org/r&#xE9;sum&#xE9;.html (in XML
Notation) is in NFC. On the other hand, Notation) is in NFC. On the other hand,
http://www.example.org/re&#x301;sume&#x301;.html is not in NFC. The http://www.example.org/re&#x301;sume&#x301;.html is not in NFC. The
former uses precombined e-acute characters, the latter uses 'e' former uses precombined e-acute characters, the latter uses 'e'
characters followed by combining acute accents. Both usages are characters followed by combining acute accents. Both usages are
defined to be canonically equivalent in [UNIV4]. defined to be canonically equivalent in [UNIV4].
Note: Because it is unknown how a particular field is being treated Note: Because it is unknown how a particular sequence of characters
with respect to text normalization, it would be inappropriate to is being treated with respect to character normalization, it would
allow third parties to normalize an IRI arbitrarily. This does be inappropriate to allow third parties to normalize an IRI
not contradict the recommendation that when a resource is created, arbitrarily. This does not contradict the recommendation that
its IRI should be as normalized as possible (i.e. NFC or even when a resource is created, its IRI should be as
NFKC). This is similar to the upper-case/lower-case problems in character-normalized as possible (i.e. NFC or even NFKC). This
URIs. Some parts of a URI are case-insensitive (domain name). is similar to the upper-case/lower-case problems in
For others, it is unclear whether they are case-sensitive or character-normalized as possible (i.e. NFC or even NFKC). URIs.
Some parts of a URI are case-insensitive (domain name). For
others, it is unclear whether they are case-sensitive or
case-insensitive, or something in between (e.g. case-sensitive, case-insensitive, or something in between (e.g. case-sensitive,
but if the wrong case is used, a multiple choice selection is but if the wrong case is used, a multiple choice selection is
provided instead of a direct negative result). The best recipe is provided instead of a direct negative result). The best recipe is
that the creator uses a reasonable capitalization, and when that the creator uses a reasonable capitalization, and when
transferring the URI, that capitalization is never changed. transferring the URI, that capitalization is never changed.
Various IRI schemes may allow the usage of International Domain Names Various IRI schemes may allow the usage of Internationalized Domain
(IDN) [RFC3490]. When in use in IRIs, those names SHOULD be Names (IDN) [RFC3490] either in the ireg-name part or elsewhere.
validated using the ToASCII operation defined in [RFC3490], with the Character Normalization also applies to IDNs, as discussed in Section
flags "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing 5.3.3.
an invalid IDN cannot successfully be resolved. For legibility
purposes, IDN components of IRIs SHOULD NOT be converted into ASCII
Compatible Encoding (ACE).
5.4 Preferred Forms 5.3.2.3 Percent-Encoding Normalization
The following are the preferred forms for IRIs when created: The percent-encoding mechanism (Section 2.1 of [RFCYYYY]) is a
frequent source of variance among otherwise identical IRIs. In
addition to the case normalization issue noted above, some IRI
producers percent-encode octets that do not require percent-encoding,
resulting in IRIs that are equivalent to their nonencoded
counterparts. Such IRIs should be normalized by decoding any
percent-encoded octet sequence that corresponds to an unreserved
character, as described in Section 2.3 of [RFCYYYY].
- Always provide the URI scheme in lowercase characters. For actual resolution, differences in percent-encoding (except for
the percent-encoding of reserved characters) MUST always result in
the same resource. For example, http://example.org/~user,
http://example.org/%7euser and http://example.org/%7Euser must
resolve to the same resource.
- Only perform percent-encoding where it is essential. If this kind of equivalence is to be tested, the percent-encoding of
both IRIs to be compared has to be aligned, for example by converting
both IRIs to URIs (see Section 3.1), eliminating escape differences
in the resulting URIs, and making sure that the case of the
hexadecimal characters in the percent-encoding is always the same
(preferably upper case). If the IRI is to be passed to another
application, or used further in some other way, its original form
MUST be preserved; the conversion described here should be performed
only for the purpose of local comparison.
- Always use uppercase A-through-F characters when percent-encoding. 5.3.2.4 Path Segment Normalization
- For those schemes where ireg-name is a domain name, always provide The complete path segments "." and ".." are intended only for use
the individual labels, in the form produced when applying nameprep within relative references (Section 4.1 of [RFCYYYY]) and are removed
[RFC3491]. This in particular includes using lowercase characters as part of the reference resolution process (Section 5.2 of
rather than uppercase characters where applicable. Also, always [RFCYYYY]). However, some implementations may incorrectly assume
use US-ASCII '.' as a separator. that reference resolution is not necessary when the reference is
already an IRI, and thus fail to remove dot-segments when they occur
in non-relative paths. IRI normalizers should remove dot-segments by
applying the remove_dot_segments algorithm to the path, as described
in Section 5.2.4 of [RFCYYYY].
- Where possible, provide IRI components in NFKC or NFC. 5.3.3 Scheme-based Normalization
- Prevent /./ and /../ from appearing in IRI paths. The syntax and semantics of IRIs vary from scheme to scheme, as
described by the defining specification for each scheme.
Implementations may use scheme-specific rules, at further processing
cost, to reduce the probability of false negatives. For example,
since the "http" scheme makes use of an authority component, has a
default port of "80", and defines an empty path to be equivalent to
"/", the following four IRIs are equivalent:
- For schemes that define an empty path to be equivalent to a path http://example.com
of "/", use "/". http://example.com/
http://example.com:/
http://example.com:80/
In general, an IRI that uses the generic syntax for authority with an
empty path should be normalized to a path of "/"; likewise, an
explicit ":port", where the port is empty or the default for the
scheme, is equivalent to one where the port and its ":" delimiter are
elided, and thus should be removed by scheme-based normalization.
For example, the second IRI above is the normal form for the "http"
scheme.
Another case where normalization varies by scheme is in the handling
of an empty authority component or empty host subcomponent. For many
scheme specifications, an empty authority or host is considered an
error; for others, it is considered equivalent to "localhost" or the
end-user's host. When a scheme defines a default for authority and
an IRI reference to that default is desired, the reference should be
normalized to an empty authority for the sake of uniformity, brevity,
and internationalization. If, however, either the userinfo or port
subcomponent is non-empty, then the host should be given explicitly
even if it matches the default.
Normalization should not remove delimiters when their associated
component is empty unless licensed to do so by the scheme
specification. For example, the IRI "http://example.com/?" cannot be
assumed to be equivalent to any of the examples above. Likewise, the
presence or absence of delimiters within a userinfo subcomponent is
usually significant to its interpretation. The fragment component is
not subject to any scheme-based normalization; thus, two IRIs that
differ only by the suffix "#" are considered different regardless of
the scheme.
Some IRI schemes may allow the usage of Internationalized Domain
Names (IDN) [RFC3490] either in their ireg-name part or elsewhere.
When in use in IRIs, those names SHOULD be validated using the
ToASCII operation defined in [RFC3490], with the flags
"UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing an
invalid IDN cannot successfully be resolved. Validated IDN
components of IRIs SHOULD be character normalized using the Nameprep
process [RFC3491]; however, for legibility purposes, they SHOULD NOT
be converted into ASCII Compatible Encoding (ACE).
Scheme-based normalization may also consider IDN components and their
conversions to punycode as equivalent. As an example,
http://r&#xE9;sum&#xE9;.example.org may be considered equivalent to
http://xn--rsum-bpad.example.org
Other scheme-specific normalizations are possible.
5.3.4 Protocol-based Normalization
Web spiders, for which substantial effort to reduce the incidence of
false negatives is often cost-effective, are observed to implement
even more aggressive techniques in IRI comparison. For example, if
they observe that an IRI such as
http://example.com/data
redirects to an IRI differing only in the trailing slash
http://example.com/data/
they will likely regard the two as equivalent in the future. This
kind of technique is only appropriate when equivalence is clearly
indicated by both the result of accessing the resources and the
common conventions of their scheme's dereference algorithm (in this
case, use of redirection by HTTP origin servers to avoid problems
with relative references).
6. Use of IRIs 6. Use of IRIs
6.1 Limitations on UCS Characters Allowed in IRIs 6.1 Limitations on UCS Characters Allowed in IRIs
This section discusses limitations on characters and character This section discusses limitations on characters and character
sequences usable for IRIs beyond those given in Section 2.2 and sequences usable for IRIs beyond those given in Section 2.2 and
Section 4.1. The considerations in this section are relevant when Section 4.1. The considerations in this section are relevant when
creating IRIs and when converting from URIs to IRIs. creating IRIs and when converting from URIs to IRIs.
skipping to change at page 35, line 15 skipping to change at page 39, line 31
The discussion on the issue addressed here has started a long time The discussion on the issue addressed here has started a long time
ago. There was a thread in the HTML working group in August 1995 ago. There was a thread in the HTML working group in August 1995
(under the topic of "Globalizing URIs") and in the www-international (under the topic of "Globalizing URIs") and in the www-international
mailing list in July 1996 (under the topic of "Internationalization mailing list in July 1996 (under the topic of "Internationalization
and URLs"), and ad-hoc meetings at the Unicode conferences in and URLs"), and ad-hoc meetings at the Unicode conferences in
September 1995 and September 1997. September 1995 and September 1997.
Many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding, Many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding,
Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim
Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie
Daigle, Ted Hardie, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman, Russ Housley,
Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Texin, Graham Klyne,
Costello, Dan Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam Costello, Dan
Badami, Jonathan Rosenne, Asmus Freytag, Simon Josefsson, Carlos Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy Badami, Jonathan
Viegas Damasio, Chris Haynes, Walter Underwood, and many others for Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio, Chris
help with understanding the issues and possible solutions, and Haynes, Walter Underwood, and many others for help with understanding
getting the details right. the issues and possible solutions, and getting the details right.
This document is a product of the Internationalization Working Group This document is a product of the Internationalization Working Group
(I18N WG) of the World Wide Web Consortium (W3C). Thanks to the (I18N WG) of the World Wide Web Consortium (W3C). Thanks to the
members of the W3C I18N Working Group and Interest Group for their members of the W3C I18N Working Group and Interest Group for their
contributions and their work on [CharMod]. Thanks also go to the contributions and their work on [CharMod]. Thanks also go to the
members of many other W3C Working Groups for adopting IRIs, and to members of many other W3C Working Groups for adopting IRIs, and to
the members of the Montreal IAB Workshop on Internationalization and the members of the Montreal IAB Workshop on Internationalization and
Localization for their review. Localization for their review.
11. References 11. References
skipping to change at page 36, line 17 skipping to change at page 40, line 34
Profile for Internationalized Domain Names (IDN)", RFC Profile for Internationalized Domain Names (IDN)", RFC
3491, March 2003. 3491, March 2003.
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
10646", STD 63, RFC 3629, November 2003. 10646", STD 63, RFC 3629, November 2003.
[RFCYYYY] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform [RFCYYYY] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform
Resource Identifier (URI): Generic Syntax (Note to the RFC Resource Identifier (URI): Generic Syntax (Note to the RFC
Editor: Please update this reference with the RFC Editor: Please update this reference with the RFC
resulting from draft-fielding-uri-rfc2396bis-xx.txt, and resulting from draft-fielding-uri-rfc2396bis-xx.txt, and
remove this Note)", draft-fielding-uri-rfc2396bis-07.txt remove this Note)", draft-fielding-uri-rfc2396bis-07 (work
(work in progress), April 2004. in progress), April 2004.
[UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard
Annex #9, March 2004, Annex #9, March 2004,
<http://www.unicode.org/reports/tr9/tr9-13.html>. <http://www.unicode.org/reports/tr9/tr9-13.html>.
[UNIV4] The Unicode Consortium, "The Unicode Standard, Version [UNIV4] The Unicode Consortium, "The Unicode Standard, Version
4.0.1, defined by: The Unicode Standard, Version 4.0 4.0.1, defined by: The Unicode Standard, Version 4.0
(Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1), (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1),
as amended by Unicode 4.0.1 as amended by Unicode 4.0.1
(http://www.unicode.org/versions/Unicode4.0.1/)", March (http://www.unicode.org/versions/Unicode4.0.1/)", March
skipping to change at page 36, line 46 skipping to change at page 41, line 15
11.2 Non-normative References 11.2 Non-normative References
[BidiEx] "Examples of bidirectional IRIs", [BidiEx] "Examples of bidirectional IRIs",
<http://www.w3.org/International/iri-edit/BidiExamples>. <http://www.w3.org/International/iri-edit/BidiExamples>.
[CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M. and T. [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M. and T.
Texin, "Character Model for the World Wide Web", World Texin, "Character Model for the World Wide Web", World
Wide Web Consortium Working Draft, February 2004, Wide Web Consortium Working Draft, February 2004,
<http://www.w3.org/TR/charmod>. <http://www.w3.org/TR/charmod>.
[Duerst01]
Duerst, M., "Internationalized Resource Identifiers: From
Specification to Testing", Proc. 19th International
Unicode Conference, San Jose , September 2001,
<http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html>.
[Duerst97] [Duerst97]
Duerst, M., "The Properties and Promises of UTF-8", Proc. Duerst, M., "The Properties and Promises of UTF-8", Proc.
11th International Unicode Conference, San Jose , 11th International Unicode Conference, San Jose ,
September 1997, September 1997,
<http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/ <http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/
IUC11-UTF-8.pdf>. IUC11-UTF-8.pdf>.
[Gettys] Gettys, J., "URI Model Consequences", [Gettys] Gettys, J., "URI Model Consequences",
<http://www.w3.org/DesignIssues/ModelConsequences>. <http://www.w3.org/DesignIssues/ModelConsequences>.
[HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01
Specification", World Wide Web Consortium Recommendation, Specification", World Wide Web Consortium Recommendation,
December 1999, December 1999,
<http://www.w3.org/TR/REC-html40/appendix/ <http://www.w3.org/TR/REC-html40/appendix/
notes.html#h-B.2>. notes.html#h-B.2>.
[RFC2045] Freed, N. and N. Freed, "Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message
Bodies", RFC 2045, November 1996.
[RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
Atkinson, R., Crispin, M. and P. Svanberg, "The Report of Atkinson, R., Crispin, M. and P. Svanberg, "The Report of
the IAB Character Set Workshop held 29 February - 1 March, the IAB Character Set Workshop held 29 February - 1 March,
1996", RFC 2130, April 1997. 1996", RFC 2130, April 1997.
[RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.
[RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997. [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997.
[RFC2277] Alvestrand, H., "IETF Policy on Character Sets and [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
skipping to change at page 38, line 11 skipping to change at page 42, line 25
Protocol", RFC 2640, July 1999. Protocol", RFC 2640, July 1999.
[RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke, [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke,
"Guidelines for new URL Schemes", RFC 2718, November 1999. "Guidelines for new URL Schemes", RFC 2718, November 1999.
[UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other
Markup Languages", Unicode Technical Report #20, World Markup Languages", Unicode Technical Report #20, World
Wide Web Consortium Note, February 2002, Wide Web Consortium Note, February 2002,
<http://www.w3.org/TR/unicode-xml/>. <http://www.w3.org/TR/unicode-xml/>.
[W3CIRI] Duerst, M., "Internationalization - URIs and other
identifiers", September 2002,
<http://www.w3.org/International/O-URL-and-ident.html>.
[XLink] DeRose, S., Maler, E. and D. Orchard, "XML Linking [XLink] DeRose, S., Maler, E. and D. Orchard, "XML Linking
Language (XLink) Version 1.0", World Wide Web Consortium Language (XLink) Version 1.0", World Wide Web Consortium
Recommendation, June 2001, Recommendation, June 2001,
<http://www.w3.org/TR/xlink/#link-locators>. <http://www.w3.org/TR/xlink/#link-locators>.
[XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E. and [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E. and
F. Yergeau, "Extensible Markup Language (XML) 1.0 (Third F. Yergeau, "Extensible Markup Language (XML) 1.0 (Third
Edition)", World Wide Web Consortium Recommendation, Edition)", World Wide Web Consortium Recommendation,
February 2004, February 2004,
<http://www.w3.org/TR/REC-xml#sec-external-ent>. <http://www.w3.org/TR/REC-xml#sec-external-ent>.
 End of changes. 

This html diff was produced by rfcdiff 1.16, available from http://www.levkowetz.com/ietf/tools/rfcdiff/