draft-duerst-iri-07.txt   draft-duerst-iri-08.txt 
Network Working Group M. Duerst Network Working Group M. Duerst
Internet-Draft W3C Internet-Draft W3C
Expires: November 7, 2004 M. Suignard Expires: November 26, 2004 M. Suignard
Microsoft Corporation Microsoft Corporation
May 9, 2004 May 28, 2004
Internationalized Resource Identifiers (IRIs) Internationalized Resource Identifiers (IRIs)
draft-duerst-iri-07 draft-duerst-iri-08
Status of this Memo Status of this Memo
By submitting this Internet-Draft, I certify that any applicable By submitting this Internet-Draft, I certify that any applicable
patent or other IPR claims of which I am aware have been disclosed, patent or other IPR claims of which I am aware have been disclosed,
and any of which I become aware will be disclosed, in accordance with and any of which I become aware will be disclosed, in accordance with
RFC 3668. RFC 3668.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other Task Force (IETF), its areas, and its working groups. Note that
groups may also distribute working documents as Internet-Drafts. other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at http:// The list of current Internet-Drafts can be accessed at
www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on November 7, 2004. This Internet-Draft will expire on November 26, 2004.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2004). All Rights Reserved. Copyright (C) The Internet Society (2004). All Rights Reserved.
Abstract Abstract
This document defines a new protocol element, the Internationalized This document defines a new protocol element, the Internationalized
Resource Identifier (IRI), as a complement to the Uniform Resource Resource Identifier (IRI), as a complement to the Uniform Resource
Identifier (URI). An IRI is a sequence of characters from the Identifier (URI). An IRI is a sequence of characters from the
skipping to change at page 2, line 9 skipping to change at page 2, line 10
URIs is defined, which means that IRIs can be used instead of URIs URIs is defined, which means that IRIs can be used instead of URIs
where appropriate to identify resources. where appropriate to identify resources.
The approach of defining a new protocol element was chosen, instead The approach of defining a new protocol element was chosen, instead
of extending or changing the definition of URIs, to allow a clear of extending or changing the definition of URIs, to allow a clear
distinction and to avoid incompatibilities with existing software. distinction and to avoid incompatibilities with existing software.
Guidelines for the use and deployment of IRIs in various protocols, Guidelines for the use and deployment of IRIs in various protocols,
formats, and software components that now deal with URIs are formats, and software components that now deal with URIs are
provided. provided.
Editorial Note
This document is a product of the Internationalization Working Group
(I18N WG) of the World Wide Web Consortium (W3C). For general
discussion, please use the public-iri@w3.org mailing list (publicly
archived at http://lists.w3.org/Archives/Public/public-iri/). An
issues list for this document is maintained at http://www.w3.org/
International/iri-edit#issues. For more information on the topic of
this document, please also see [W3CIRI] and [Duerst01].
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Overview and Motivation . . . . . . . . . . . . . . . . . 4 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . 4
1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 6
2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 7 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 7
2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . 8 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . 8
3. Relationship between IRIs and URIs . . . . . . . . . . . . . . 10 3. Relationship between IRIs and URIs . . . . . . . . . . . . . . 11
3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . 10 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . 11
3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . 13 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . 14
3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . 16
4. Bidirectional IRIs for Right-to-left Languages . . . . . . . . 16 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . . 17
4.1 Logical Storage and Visual Presentation . . . . . . . . . 17 4.1 Logical Storage and Visual Presentation . . . . . . . . . 17
4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . 18 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . 19
4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . 19 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . 20
4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 20
5. IRI Equivalence and Comparison . . . . . . . . . . . . . . . . 21 5. IRI Equivalence and Comparison . . . . . . . . . . . . . . . . 22
5.1 Simple String Comparison . . . . . . . . . . . . . . . . . 21 5.1 Simple String Comparison . . . . . . . . . . . . . . . . . 22
5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . 22 5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . 23
5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . 22 5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . 23
5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . 23 5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . 24
6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 24 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . 24 6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . 25
6.2 Software Interfaces and Protocols . . . . . . . . . . . . 24 6.2 Software Interfaces and Protocols . . . . . . . . . . . . 25
6.3 Format of URIs and IRIs in Documents and Protocols . . . . 25 6.3 Format of URIs and IRIs in Documents and Protocols . . . . 26
6.4 Use of UTF-8 for Encoding Original Characters . . . . . . 25 6.4 Use of UTF-8 for Encoding Original Characters . . . . . . 26
6.5 Relative IRI References . . . . . . . . . . . . . . . . . 26 6.5 Relative IRI References . . . . . . . . . . . . . . . . . 27
7. URI/IRI Processing Guidelines (informative) . . . . . . . . . 26 7. URI/IRI Processing Guidelines (informative) . . . . . . . . . 27
7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . 26 7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . 27
7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 27 7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 28
7.3 URI/IRI Transfer Between Applications . . . . . . . . . . 28 7.3 URI/IRI Transfer Between Applications . . . . . . . . . . 29
7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 28 7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 29
7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 29 7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 30
7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 29 7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 30
7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . 30 7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . 31
7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 30 7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 31
8. Security Considerations . . . . . . . . . . . . . . . . . . . 31 8. Security Considerations . . . . . . . . . . . . . . . . . . . 32
9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 33 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 34
10. References . . . . . . . . . . . . . . . . . . . . . . . . . 33 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 34
10.1 Normative References . . . . . . . . . . . . . . . . . . . . 33 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 34
10.2 Non-normative References . . . . . . . . . . . . . . . . . . 34 11.1 Normative References . . . . . . . . . . . . . . . . . . . . 34
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 36 11.2 Non-normative References . . . . . . . . . . . . . . . . . . 35
A. Design Alternatives . . . . . . . . . . . . . . . . . . . . . 37 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 38
A.1 New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 37 A. Design Alternatives . . . . . . . . . . . . . . . . . . . . . 38
A.2 Other Character Encodings than UTF-8 . . . . . . . . . . . 37 A.1 New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 38
A.3 New Encoding Convention . . . . . . . . . . . . . . . . . 38 A.2 Other Character Encodings than UTF-8 . . . . . . . . . . . 39
A.4 Indicating Character Encodings in the URI/IRI . . . . . . 38 A.3 New Encoding Convention . . . . . . . . . . . . . . . . . 39
Intellectual Property and Copyright Statements . . . . . . . . 39 A.4 Indicating Character Encodings in the URI/IRI . . . . . . 39
Intellectual Property and Copyright Statements . . . . . . . . 40
1. Introduction 1. Introduction
1.1 Overview and Motivation 1.1 Overview and Motivation
A URI is defined in [RFCYYYY] as a sequence of characters chosen from A Uniform Resource Identifier (URI) is defined in [RFCYYYY] as a
a limited subset of the repertoire of US-ASCII characters. sequence of characters chosen from a limited subset of the repertoire
of US-ASCII [ASCII] characters.
The characters in URIs are frequently used for representing words of The characters in URIs are frequently used for representing words of
natural languages. Such usage has many advantages: such URIs are natural languages. Such usage has many advantages: such URIs are
easier to memorize, easier to interpret, easier to transcribe, easier easier to memorize, easier to interpret, easier to transcribe, easier
to create, and easier to guess. For most languages other than to create, and easier to guess. For most languages other than
English, however, the natural script uses characters other than A-Z. English, however, the natural script uses characters other than A-Z.
For many people, handling Latin characters is as difficult as For many people, handling Latin characters is as difficult as
handling the characters of other scripts is for people who use only handling the characters of other scripts is for people who use only
the Latin alphabet. Many languages with non-Latin scripts have the Latin alphabet. Many languages with non-Latin scripts have
transcriptions to Latin letters. Such transcriptions are now often transcriptions to Latin letters. Such transcriptions are now often
used in URIs, but they introduce additional ambiguities. used in URIs, but they introduce additional ambiguities.
The infrastructure for the appropriate handling of characters from The infrastructure for the appropriate handling of characters from
local scripts is now widely deployed in local versions of operating local scripts is now widely deployed in local versions of operating
system and application software. Software that can handle a wide system and application software. Software that can handle a wide
variety of scripts and languages at the same time is increasingly variety of scripts and languages at the same time is increasingly
widespread. Also, there are increasing numbers of protocols and widespread. Also, there are increasing numbers of protocols and
formats that can carry a wide range of characters. formats that can carry a wide range of characters.
This document defines a new protocol element, called IRI This document defines a new protocol element, called
(Internationalized Resource Identifier), by extending the syntax of Internationalized Resource Identifier (IRI), by extending the syntax
URIs to a much wider repertoire of characters. It also defines of URIs to a much wider repertoire of characters. It also defines
"internationalized" versions corresponding to other constructs from "internationalized" versions corresponding to other constructs from
[RFCYYYY], such as URI references. [RFCYYYY], such as URI references. The syntax of IRIs is defined in
Section 2, and the relationship between IRIs and URIs in Section 3.
Using characters outside of A-Z in IRIs brings with it some Using characters outside of A-Z in IRIs brings with it some
difficulties; a discussion of potential problems and workarounds can difficulties. Section 4 discusses the special case of bidirectional
be found in the later sections of this document. IRIs, Section 5 various forms of equivalence between IRIs, and
Section 6 the use of IRIs in different situations. Section 7 gives
additional informative guidelines, and Section 8 security
considerations.
For discussion of this document, please use the public-iri@w3.org
mailing list (publicly archived at
http://lists.w3.org/Archives/Public/public-iri/). An issues list for
this document is maintained at
http://www.w3.org/International/iri-edit#issues. For more
information on the topic of this document, please also see [W3CIRI]
and [Duerst01].
1.2 Applicability 1.2 Applicability
IRIs are designed to be compatible with recent recommendations for IRIs are designed to be compatible with recent recommendations for
new URI schemes [RFC2718]. The compatibility is provided by new URI schemes [RFC2718]. The compatibility is provided by
specifying a well defined and deterministic mapping from the IRI specifying a well defined and deterministic mapping from the IRI
character sequence to the functionally equivalent URI character character sequence to the functionally equivalent URI character
sequence. Practical use of IRIs (or IRI references) in place of URIs sequence. Practical use of IRIs (or IRI references) in place of URIs
(or URI references) depends on the following conditions being met: (or URI references) depends on the following conditions being met:
a) The protocol or format element used should be explicitly a) The protocol or format element where IRIs are used should be
designated to carry IRIs. That is, the intent is not to introduce explicitly designated to be able to carry IRIs. That is, the
IRIs into contexts that are not defined to accept them. For intent is not to introduce IRIs into contexts that are not defined
example, XML schema [XMLSchema] has an explicit type "anyURI" that to accept them. For example, XML schema [XMLSchema] has an
designates the use of IRIs. explicit type "anyURI" that includes IRIs and IRI references.
Therefore, IRIs and IRI references can be in attributes and
elements of type "anyURI". On the other hand, in the HTTP
protocol [RFC2616], the Request URI is defined as an URI, which
means that direct use of IRIs is not allowed in HTTP requests.
b) The protocol or format carrying the IRIs should have a mechanism b) The protocol or format carrying the IRIs should have a mechanism
to represent the wide range of characters used in IRIs, either to represent the wide range of characters used in IRIs, either
natively or by some protocol- or format-specific escaping natively or by some protocol- or format-specific escaping
mechanism (for example numeric character references in [XML1]). mechanism (for example numeric character references in [XML1]).
c) The URI corresponding to the IRI in question has to encode c) The URI corresponding to the IRI in question has to encode
original characters into octets using UTF-8. For new URI schemes, original characters into octets using UTF-8. For new URI schemes,
this is recommended in [RFC2718]. It can apply to a whole scheme this is recommended in [RFC2718]. It can apply to a whole scheme
(e.g. IMAP URLs [RFC2192] and POP URLs [RFC2384], or the URN (e.g. IMAP URLs [RFC2192] and POP URLs [RFC2384], or the URN
skipping to change at page 5, line 49 skipping to change at page 6, line 20
(unambiguously) converting a sequence of octets into a sequence of (unambiguously) converting a sequence of octets into a sequence of
characters. characters.
charset: The name of a parameter or attribute used to identify a charset: The name of a parameter or attribute used to identify a
character encoding. character encoding.
UCS: Universal Character Set; the coded character set defined by ISO/ UCS: Universal Character Set; the coded character set defined by ISO/
IEC 10646 [ISO10646] and the Unicode Standard [UNIV4]. IEC 10646 [ISO10646] and the Unicode Standard [UNIV4].
IRI reference: The term "IRI reference" denotes the common usage of IRI reference: The term "IRI reference" denotes the common usage of
an internationalized resource identifier. An IRI reference may be an Internationalized Resource Identifier. An IRI reference may be
absolute or relative. However, the "IRI" that results from such a absolute or relative. However, the "IRI" that results from such a
reference only includes absolute IRIs; any relative IRIs are reference only includes absolute IRIs; any relative IRIs are
resolved to their absolute form. Note that in [RFC2396], URIs did resolved to their absolute form. Note that in [RFC2396], URIs did
not include fragment identifiers, but in [RFCYYYY], fragment not include fragment identifiers, but in [RFCYYYY], fragment
identifiers are part of URIs. identifiers are part of URIs.
running text: Human text (paragraphs, sentences, phrases) with syntax running text: Human text (paragraphs, sentences, phrases) with syntax
according to orthographic conventions of a natural language, as according to orthographic conventions of a natural language, as
opposed to syntax defined for ease of processing by machines opposed to syntax defined for ease of processing by machines
(markup, programming languages,...). (markup, programming languages,...).
protocol element: Any portion of a message which affects processing protocol element: Any portion of a message which affects processing
of that message by the protocol in question. of that message by the protocol in question.
presentation element: Presentation form corresponding to a protocol presentation element: Presentation form corresponding to a protocol
element, for example using a wider range of characters. element, for example using a wider range of characters.
create (an URI or IRI): With respect to URIs and IRIs, the word create (an URI or IRI): With respect to URIs and IRIs, the word
'create' is used for the initial creation. This may be the initial 'create' is used for the initial creation. This may be the
creation of a resource with a certain name, or the initial initial creation of a resource with a certain name, or the initial
exposition of a resource under a particular name. exposition of a resource under a particular name.
generate (an URI or IRI): With respect to URIs and IRIs, the word generate (an URI or IRI): With respect to URIs and IRIs, the word
'generate' is used when the IRI is generated by derivation from 'generate' is used when the IRI is generated by derivation from
other information. other information.
1.4 Notation 1.4 Notation
RFCs and Internet Drafts currently do not allow any characters RFCs and Internet Drafts currently do not allow any characters
outside the US-ASCII repertoire. Therefore, this document uses outside the US-ASCII repertoire. Therefore, this document uses
skipping to change at page 6, line 46 skipping to change at page 7, line 17
using a prefix of 'U+', followed by four to six hexadecimal digits. using a prefix of 'U+', followed by four to six hexadecimal digits.
To represent characters outside US-ASCII in examples, this document To represent characters outside US-ASCII in examples, this document
uses two notations called 'XML Notation' and 'Bidi Notation'. uses two notations called 'XML Notation' and 'Bidi Notation'.
XML Notation uses leading '&#x', trailing ';', and the hexadecimal XML Notation uses leading '&#x', trailing ';', and the hexadecimal
number of the character in the UCS in between. Example: я number of the character in the UCS in between. Example: я
stands for CYRILLIC CAPITAL LETTER YA. In this notation, an actual stands for CYRILLIC CAPITAL LETTER YA. In this notation, an actual
'&' is denoted by '&'. '&' is denoted by '&'.
Bidi Notation is used for bidirectional examples: lower case ASCII Bidi Notation is used for bidirectional examples: lower case letters
letters stand for Latin letters or other letters that are written stand for Latin letters or other letters that are written
left-to-right, whereas upper case letters represent Arabic or Hebrew left-to-right, whereas upper case letters represent Arabic or Hebrew
letters that are written right-to-left. letters that are written right-to-left.
To denote actual octets in examples (as opposed to percent-encoded To denote actual octets in examples (as opposed to percent-encoded
octets), the two hex digits denoting the octet are enclosed in "<" octets), the two hex digits denoting the octet are enclosed in "<"
and ">". For example, the octet often denoted as 0xc9 is denoted here and ">". For example, the octet often denoted as 0xc9 is denoted
as <c9>. here as <c9>.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
2. IRI Syntax 2. IRI Syntax
This section defines the syntax of Internationalized Resource This section defines the syntax of Internationalized Resource
Identifiers (IRIs). Identifiers (IRIs).
skipping to change at page 7, line 43 skipping to change at page 8, line 12
unreserved characters is extended by adding the characters of the UCS unreserved characters is extended by adding the characters of the UCS
(Universal Character Set, [ISO10646]) beyond U+007F, subject to the (Universal Character Set, [ISO10646]) beyond U+007F, subject to the
limitations given in the syntax rules below and in Section 6.1. limitations given in the syntax rules below and in Section 6.1.
Otherwise, the syntax and use of components and reserved characters Otherwise, the syntax and use of components and reserved characters
is the same as that in [RFCYYYY]. All the operations defined in is the same as that in [RFCYYYY]. All the operations defined in
[RFCYYYY], such as the resolution of relative URIs, can be applied to [RFCYYYY], such as the resolution of relative URIs, can be applied to
IRIs by IRI-processing software in exactly the same way as this is IRIs by IRI-processing software in exactly the same way as this is
done to URIs by URI-processing software. done to URIs by URI-processing software.
Characters outside the US-ASCII range are not reserved and therefore Characters outside the US-ASCII repertoire are not reserved and
MUST NOT be used for syntactical purposes such as to delimit therefore MUST NOT be used for syntactical purposes such as to
components in newly defined schemes. As an example, it is not allowed delimit components in newly defined schemes. As an example, it is
to use U+00A2, CENT SIGN, as a delimiter in IRIs, because it is in not allowed to use U+00A2, CENT SIGN, as a delimiter in IRIs, because
the 'iunreserved' category, in the same way as it is not possible to it is in the 'iunreserved' category, in the same way as it is not
use '-' as a delimiter, because it is in the 'unreserved' category in possible to use '-' as a delimiter, because it is in the 'unreserved'
URIs. category in URIs.
2.2 ABNF for IRI References and IRIs 2.2 ABNF for IRI References and IRIs
While it might be possible to define IRI references and IRIs merely While it might be possible to define IRI references and IRIs merely
by their transformation to URI references and URIs, they can also be by their transformation to URI references and URIs, they can also be
accepted and processed directly. Therefore, an ABNF definition for accepted and processed directly. Therefore, an ABNF definition for
IRI references (which are the most general concept and the start of IRI references (which are the most general concept and the start of
the grammar) and IRIs is given here. The syntax of this ABNF is the grammar) and IRIs is given here. The syntax of this ABNF is
described in [RFC2234]. Character numbers are taken from the UCS, described in [RFC2234]. Character numbers are taken from the UCS,
without implying any actual binary encoding. Terminals in the ABNF without implying any actual binary encoding. Terminals in the ABNF
are characters, not bytes. are characters, not bytes.
The following grammar closely follows the URI grammar in [RFCYYYY],
except that the range of unreserved characters is expanded to include
UCS characters, with the restriction that private UCS characters can
occur only in query parts and not elsewhere. The grammar is split
into two parts, rules that differ from [RFCYYYY] because of the
above-mentioned expansion, and rules that are the same as in
[RFCYYYY]. For rules that are different than in [RFCYYYY], the names
of the non-terminals have been changed as follows: If the
non-terminal contains 'URI', this has been changed to 'IRI'.
Otherwise, an 'i' has been prefixed.
The following rules are different from [RFCYYYY]: The following rules are different from [RFCYYYY]:
IRI = scheme ":" ihier-part [ "?" iquery ] IRI = scheme ":" ihier-part [ "?" iquery ]
[ "#" ifragment ] [ "#" ifragment ]
ihier-part = "//" iauthority ipath-abempty ihier-part = "//" iauthority ipath-abempty
/ ipath-abs / ipath-abs
/ ipath-rootless / ipath-rootless
/ ipath-empty / ipath-empty
IRI-reference = IRI / relative-IRI IRI-reference = IRI / relative-IRI
absolute-IRI = scheme ":" ihier-part [ "?" iquery ] absolute-IRI = scheme ":" ihier-part [ "?" iquery ]
relative-IRI = irelative-part [ "?" iquery ] [ "#" ifragment ] relative-IRI = irelative-part [ "?" iquery ] [ "#" ifragment ]
irelative-part = "//" iauthority ipath-abempty irelative-part = "//" iauthority ipath-abempty
/ ipath-abs / ipath-abs
/ ipath-noscheme / ipath-noscheme
/ ipath-empty / ipath-empty
skipping to change at page 11, line 11 skipping to change at page 11, line 39
Scheme-specific restrictions are applied to IRIs by converting Scheme-specific restrictions are applied to IRIs by converting
IRIs to URIs and checking the URIs against the scheme-specific IRIs to URIs and checking the URIs against the scheme-specific
restrictions. restrictions.
b) Interpretational: URIs identify resources in various ways. IRIs b) Interpretational: URIs identify resources in various ways. IRIs
also identify resources. When the IRI is used solely for also identify resources. When the IRI is used solely for
identification purposes, it is not necessary to map the IRI to a identification purposes, it is not necessary to map the IRI to a
URI (see Section 5). However, when an IRI is used for resource URI (see Section 5). However, when an IRI is used for resource
retrieval, the resource that the IRI locates is the same as the retrieval, the resource that the IRI locates is the same as the
one located by the URI obtained after converting the IRI according one located by the URI obtained after converting the IRI according
to the procedure defined here. This means that there is no need to to the procedure defined here. This means that there is no need
define resolution separately on the IRI level. to define resolution separately on the IRI level.
Applications MUST map IRIs to URIs using the following two steps. Applications MUST map IRIs to URIs using the following two steps.
Step 1) This step generates a UCS-based character encoding from the Step 1) This step generates a UCS character sequence from the
original IRI format. This step has three variants, depending on original IRI format. This step has three variants, depending on
the form of the input. the form of the input.
Variant A) If the IRI is written on paper or read out loud, or Variant A) If the IRI is written on paper or read out loud, or
otherwise represented as a sequence of characters independent otherwise represented as a sequence of characters independent
of any character encoding: Represent the IRI as a sequence of of any character encoding: Represent the IRI as a sequence of
characters from the UCS normalized according to Normalization characters from the UCS normalized according to Normalization
Form C (NFC, [UTR15]). Form C (NFC, [UTR15]).
Variant B) If the IRI is in some digital representation (e.g. an Variant B) If the IRI is in some digital representation (e.g. an
octet stream) in some known non-Unicode character encoding: octet stream) in some known non-Unicode character encoding:
Convert the IRI to a sequence of characters from the UCS Convert the IRI to a sequence of characters from the UCS
normalized according to NFC. normalized according to NFC.
Variant C) If the IRI is in an Unicode-based character encoding Variant C) If the IRI is in an Unicode-based character encoding
(for example UTF-8 or UTF-16): Do not normalize. Move directly (for example UTF-8 or UTF-16): Do not normalize. Apply Step 2
to Step 2. directly to the encoded Unicode character sequence.
Step 2) For each character that is disallowed in URI references, Step 2) For each character in 'ucschar' or 'iprivate', apply Steps
apply Steps 2.1 through 2.3 below. The disallowed characters 2.1 through 2.3 below.
consist of all non-ASCII characters allowed in IRIs.
2.1) Convert the character to a sequence of one or more octets 2.1) Convert the character to a sequence of one or more octets
using UTF-8 [RFC3629]. using UTF-8 [RFC3629].
2.2) Convert each octet to %HH, where HH is the hexadecimal 2.2) Convert each octet to %HH, where HH is the hexadecimal
notation of the octet value. Note: This is identical to the notation of the octet value. Note that this is identical to
percent-encoding mechanism in Section 2.1 of [RFCYYYY]. To the percent-encoding mechanism in Section 2.1 of [RFCYYYY]. To
reduce variability, the hexadecimal notation SHOULD use upper reduce variability, the hexadecimal notation SHOULD use upper
case letters. case letters.
2.3) Replace the original character by the resulting character 2.3) Replace the original character by the resulting character
sequence (i.e. a sequence of %HH triplets). sequence (i.e. a sequence of %HH triplets).
The above mapping from IRIs to URIs produces URIs fully conforming to The above mapping from IRIs to URIs produces URIs fully conforming to
[RFCYYYY]. The mapping is also an identity transformation for URIs [RFCYYYY]. The mapping is also an identity transformation for URIs
and is idempotent -- applying the mapping a second time will not and is idempotent -- applying the mapping a second time will not
change anything. Every URI is by definition an IRI. change anything. Every URI is by definition an IRI.
Infrastructure accepting IRIs MAY convert the ireg-name component of Infrastructure accepting IRIs MAY convert the ireg-name component of
an IRI as follows (before Step 2.2 above) for schemes that are known an IRI as follows (before Step 2 above) for schemes that are known to
to use domain names in ireg-name, but where the scheme definition use domain names in ireg-name, but where the scheme definition does
does not allow percent-encoding for ireg-name: Replace the ireg-name not allow percent-encoding for ireg-name: Replace the ireg-name part
part of the IRI by the part converted using the ToASCII operation of the IRI by the part converted using the ToASCII operation
specified in Section 4.1 of [RFC3490] on each dot-separated label, specified in Section 4.1 of [RFC3490] on each dot-separated label,
and using U+002E (FULL STOP) as a label separator, with the flag and using U+002E (FULL STOP) as a label separator, with the flag
UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set to UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set to
FALSE for creating IRIs and set to TRUE otherwise. The ToASCII FALSE for creating IRIs and set to TRUE otherwise. The ToASCII
operation may fail, but this would mean that the IRI cannot be operation may fail, but this would mean that the IRI cannot be
resolved. This conversion SHOULD be used when the goal is to maximize resolved. This conversion SHOULD be used when the goal is to
interoperability with legacy URI resolvers. For example, the IRI maximize interoperability with legacy URI resolvers. For example,
the IRI
http://r&#xE9;sum&#xE9;.example.org may be converted to http://r&#xE9;sum&#xE9;.example.org may be converted to
http://xn--rsum-bpad.example.org instead of http://xn--rsum-bpad.example.org instead of
http://r%C3%A9sum%C3%A9.example.org. http://r%C3%A9sum%C3%A9.example.org.
An IRI with a scheme that is known to use domain names in ireg-name, An IRI with a scheme that is known to use domain names in ireg-name,
but where the scheme definition does not allow percent-encoding for but where the scheme definition does not allow percent-encoding for
ireg-name, meets scheme-specific restrictions if either the ireg-name, meets scheme-specific restrictions if either the
straightforward conversion or the conversion using the ToASCII straightforward conversion or the conversion using the ToASCII
operation on ireg-name result in an URI that meets the operation on ireg-name result in an URI that meets the
scheme-specific restrictions. An IRI with a scheme that is known to scheme-specific restrictions. An IRI with a scheme that is known to
use domain names in ireg-name, but where the scheme definition does use domain names in ireg-name, but where the scheme definition does
not allow percent-encoding for ireg-name, resolves to the URI not allow percent-encoding for ireg-name, resolves to the URI
obtained after converting the IRI including using the ToASCII obtained after converting the IRI including using the ToASCII
operation on ireg-name. Implementations do not need to do this operation on ireg-name. Implementations do not need to do this
conversion as long as they produce the same result. conversion as long as they produce the same result.
Note: The uniform treatment of the whole IRI in Step 2.2 above is Note: The difference between Variants B and C in Step 1 (Variant B
using normalization with NFC while Variant C not using any
normalization) is to account for the fact that in many non-Unicode
character encodings, some text cannot be represented directly.
For example, Vietnam is natively written "Vi&#x1EC7;t Nam"
(containing a LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW)
in NFC, but a direct transcoding from the windows-1258 character
encoding leads to "Vi&#xEA;&#x323;t Nam" (containing a LATIN SMALL
LETTER E WITH CIRCUMFLEX followed by a COMBINING DOT BELOW),
whereas direct transcoding of other 8-bit encodings of Vietnamese
may lead to other representations.
Note: The uniform treatment of the whole IRI in Step 2 above is
important to not make processing dependent on URI scheme. See important to not make processing dependent on URI scheme. See
[Gettys] for an in-depth discussion. [Gettys] for an in-depth discussion.
Note: In practice, the difference above will not be noticed if Note: In practice, the difference above will not be noticed if
mapping from IRI to URI and resolution is tightly integrated (e.g. mapping from IRI to URI and resolution is tightly integrated (e.g.
carried out in the same user agent). But conversion using carried out in the same user agent). But conversion using
[RFC3490] may be able to better deal with backwards compatibility [RFC3490] may be able to better deal with backwards compatibility
issues in case mapping and resolution are separated, as in the issues in case mapping and resolution are separated, as in the
case of using an HTTP proxy. case of using an HTTP proxy.
Note: Internationalized Domain Names may be contained in parts of an Note: Internationalized Domain Names may be contained in parts of an
IRI other than the ireg-name part. It is the responsibility of IRI other than the ireg-name part. It is the responsibility of
scheme-specific implementations (if the Internationalized Domain scheme-specific implementations (if the Internationalized Domain
Name is part of the scheme syntax) or of server-side Name is part of the scheme syntax) or of server-side
implementations (if the Internationalized Domain Name is part of implementations (if the Internationalized Domain Name is part of
'iquery') to apply the necessary conversions at the appropriate 'iquery') to apply the necessary conversions at the appropriate
point. Example: Trying to validate the Web page at point. Example: Trying to validate the Web page at
http://r&#xE9;sum&#xE9;.example.org would lead to an IRI of http://r&#xE9;sum&#xE9;.example.org would lead to an IRI of
http://validator.w3.org/ http://validator.w3.org/check?uri=http%3A%2F%2Fr&#xE9;sum&#xE9;.example.org,
check?uri=http%3A%2F%2Fr&#xE9;sum&#xE9;.example.org, which would which would convert to a URI of
convert to a URI of http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.example.org.
http://validator.w3.org/ The server side implementation would be responsible to do the
check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.example.org. The server necessary conversions in order to be able to retrieve the Web
side implementation would be responsible to do the necessary page.
conversions in order to be able to retrieve the Web page.
Infrastructure accepting IRIs MAY also deal with the printable Infrastructure accepting IRIs MAY also deal with the printable
characters in US-ASCII that are not allowed in URIs, namely "<", ">", characters in US-ASCII that are not allowed in URIs, namely "<", ">",
'"', Space, "{", "}", "|", "\", "^", and "`", in Step 2.2 above. If '"', Space, "{", "}", "|", "\", "^", and "`", in Step 2 above. If
such characters are found but are not converted, then the conversion such characters are found but are not converted, then the conversion
SHOULD fail. Please note that the number sign ("#"), the percent sign SHOULD fail. Please note that the number sign ("#"), the percent
("%"), and the square bracket characters ("[", "]") are not part of sign ("%"), and the square bracket characters ("[", "]") are not part
the above list, and MUST NOT be converted. Protocols and formats that of the above list, and MUST NOT be converted. Protocols and formats
have used earlier definitions of IRIs including these characters MAY that have used earlier definitions of IRIs including these characters
require percent-encoding of these characters as a preprocessing step MAY require percent-encoding of these characters as a preprocessing
to extract the actual IRI from a given field. Such preprocessing MAY step to extract the actual IRI from a given field. Such
also be used by applications allowing the user to enter an IRI. preprocessing MAY also be used by applications allowing the user to
enter an IRI.
Note: In this process (in Step 2.3), characters allowed in URI Note: In this process (in Step 2.3), characters allowed in URI
references as well as existing percent-encoded sequences are not references as well as existing percent-encoded sequences are not
encoded further. (This mapping is similar to, but different from, encoded further. (This mapping is similar to, but different from,
the encoding applied when including arbitrary content into some the encoding applied when including arbitrary content into some
part of a URI.) For example, an IRI of part of a URI.) For example, an IRI of
http://www.example.org/red%09ros&#xE9;#red (in XML notation) is http://www.example.org/red%09ros&#xE9;#red (in XML notation) is
converted to converted to
http://www.example.org/red%09ros%C3%A9#red, not to something like http://www.example.org/red%09ros%C3%A9#red, not to something like
http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red. http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red.
skipping to change at page 14, line 5 skipping to change at page 14, line 44
conversion to a URI is: conversion to a URI is:
http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82 http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82
3.2 Converting URIs to IRIs 3.2 Converting URIs to IRIs
In some situations, it may be desirable to try to convert a URI into In some situations, it may be desirable to try to convert a URI into
an equivalent IRI. This section gives a procedure to do such a an equivalent IRI. This section gives a procedure to do such a
conversion. The conversion described in this section will always conversion. The conversion described in this section will always
result in an IRI which maps back to the URI that was used as an input result in an IRI which maps back to the URI that was used as an input
for the conversion (except for potential case differences in for the conversion (except for potential case differences in
percent-encoding). However, the IRI resulting from this conversion percent-encoding and for potential percent-encoded unreserved
may not be exactly the same as the original IRI (if there ever was characters). However, the IRI resulting from this conversion may not
one). be exactly the same as the original IRI (if there ever was one).
URI to IRI conversion removes percent-encodings, but not all URI to IRI conversion removes percent-encodings, but not all
percent-encodings can be eliminated. There are several reasons for percent-encodings can be eliminated. There are several reasons for
this: this:
a) Some percent-encodings are necessary to distinguish a) Some percent-encodings are necessary to distinguish
percent-encoded and unencoded uses of reserved characters. percent-encoded and unencoded uses of reserved characters.
b) Some percent-encodings cannot be interpreted as sequences of UTF-8 b) Some percent-encodings cannot be interpreted as sequences of UTF-8
octets. octets.
(Note: The octet patterns of UTF-8 are highly regular. Therefore, (Note: The octet patterns of UTF-8 are highly regular. Therefore,
there is a very high probability, but no guarantee, that there is a very high probability, but no guarantee, that
percent-encodings that can be interpreted as sequences of UTF-8 percent-encodings that can be interpreted as sequences of UTF-8
octets actually originated from UTF-8. For a detailed discussion, octets actually originated from UTF-8. For a detailed discussion,
see [Duerst97].) see [Duerst97].)
c) The conversion may result in a character that is not appropriate c) The conversion may result in a character that is not appropriate
in an IRI. See Section 6.1 for further details. in an IRI. See Section 2.2, Section 4.1, and Section 6.1 for
further details.
Conversion from a URI to an IRI is done using the following steps (or Conversion from a URI to an IRI is done using the following steps (or
any other algorithm that produces the same result): any other algorithm that produces the same result):
1) Represent the URI as a sequence of octets in US-ASCII. 1) Represent the URI as a sequence of octets in US-ASCII.
2) Convert all percent-encodings (% followed by two hexadecimal 2) Convert all percent-encodings (% followed by two hexadecimal
digits) except those corresponding to '%', characters in digits) except those corresponding to '%', characters in
'reserved', and characters in US-ASCII not allowed in URIs, to the 'reserved', and characters in US-ASCII not allowed in URIs, to the
corresponding octets. corresponding octets.
3) Re-percent-encode any octet produced in Step 2 that is not part of 3) Re-percent-encode any octet produced in Step 2 that is not part of
a strictly legal UTF-8 octet sequence. a strictly legal UTF-8 octet sequence.
4) Re-percent-encode all octets produced in Step 3 that in UTF-8 4) Re-percent-encode all octets produced in Step 3 that in UTF-8
represent characters that are not appropriate according to Section represent characters that are not appropriate according to Section
4.1 and Section 6.1. 2.2, Section 4.1, and Section 6.1.
5) Interpret the resulting octet sequence as a sequence of characters 5) Interpret the resulting octet sequence as a sequence of characters
encoded in UTF-8. encoded in UTF-8.
This procedure will convert as many percent-encoded non-ASCII This procedure will convert as many percent-encoded characters as
characters as possible to characters in an IRI. Because there are possible to characters in an IRI. Because there are some choices
some choices when applying Step 4 (see Section 6.1), results may when applying Step 4 (see Section 6.1), results may vary.
vary.
Conversions from URIs to IRIs MUST NOT use any other character Conversions from URIs to IRIs MUST NOT use any other character
encoding than UTF-8 in Steps 3 and 4 above, even if it might be encoding than UTF-8 in Steps 3 and 4 above, even if it might be
possible from context to guess that another character encoding than possible from context to guess that another character encoding than
UTF-8 was used in the URI. As an example, the URI http:// UTF-8 was used in the URI. As an example, the URI
www.example.org/r%E9sum%E9.html might with some guessing be http://www.example.org/r%E9sum%E9.html might with some guessing be
interpreted to contain two e-acute characters encoded as iso-8859-1. interpreted to contain two e-acute characters encoded as iso-8859-1.
It must not be converted to an IRI containing these e-acute It must not be converted to an IRI containing these e-acute
characters. Otherwise, the IRI will in the future be mapped to http:/ characters. Otherwise, the IRI will in the future be mapped to
/www.example.org/r%C3%A9sum%C3%A9.html, which is a different URI than http://www.example.org/r%C3%A9sum%C3%A9.html, which is a different
http://www.example.org/r%E9sum%E9.html. URI than http://www.example.org/r%E9sum%E9.html.
3.2.1 Examples 3.2.1 Examples
This section shows various examples of converting URIs to IRIs. The This section shows various examples of converting URIs to IRIs. Each
notation <hh> is used to denote octets outside those that can be example shows the result after applying each of the Steps 1 to 5.
represented in this document. Each example shows the result after XML Notation is used for the final result.
applying each of the Steps 1 to 5. XML Notation is used for the final
result.
The following example contains the sequence '%C3%BC', which is a The following example contains the sequence '%C3%BC', which is a
strictly legal UTF-8 sequence, and which is converted into the actual strictly legal UTF-8 sequence, and which is converted into the actual
character U+00FC LATIN SMALL LETTER U WITH DIAERESIS (also known as character U+00FC LATIN SMALL LETTER U WITH DIAERESIS (also known as
u-umlaut). u-umlaut).
1) http://www.example.org/D%C3%BCrst 1) http://www.example.org/D%C3%BCrst
2) http://www.example.org/D<c3><bc>rst 2) http://www.example.org/D<c3><bc>rst
skipping to change at page 16, line 32 skipping to change at page 17, line 20
2) http://xn--99zt52a.example.org/<e2><80><ae> 2) http://xn--99zt52a.example.org/<e2><80><ae>
3) http://xn--99zt52a.example.org/<e2><80><ae> 3) http://xn--99zt52a.example.org/<e2><80><ae>
4) http://xn--99zt52a.example.org/%E2%80%AE 4) http://xn--99zt52a.example.org/%E2%80%AE
5) http://xn--99zt52a.example.org/%E2%80%AE 5) http://xn--99zt52a.example.org/%E2%80%AE
Implementations with scheme-specific knowledge MAY convert Implementations with scheme-specific knowledge MAY convert
punycode-encoded domain name labels to the corresponding characters punycode-encoded domain name labels to the corresponding characters
using the ToUnicode procedure. Thus, for the example above, the label using the ToUnicode procedure. Thus, for the example above, the
xn--99zt52a may be converted to U+7D0D U+8C46 (Japanese Natto), label xn--99zt52a may be converted to U+7D0D U+8C46 (Japanese Natto),
leading to the overall IRI of leading to the overall IRI of
http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE
4. Bidirectional IRIs for Right-to-left Languages 4. Bidirectional IRIs for Right-to-left Languages
Some UCS characters, such as those used in the Arabic and Hebrew Some UCS characters, such as those used in the Arabic and Hebrew
script, have an inherent right-to-left (rtl) writing direction. IRIs script, have an inherent right-to-left (rtl) writing direction. IRIs
containing such characters (called bidirectional IRIs or Bidi IRIs) containing such characters (called bidirectional IRIs or Bidi IRIs)
require additional attention because of the non-trivial relation require additional attention because of the non-trivial relation
between logical representation (used for digital representation as between logical representation (used for digital representation as
skipping to change at page 17, line 27 skipping to change at page 18, line 15
syntax rules (which includes the rules relevant to their scheme). syntax rules (which includes the rules relevant to their scheme).
This assures that bidirectional IRIs can be processed in the same way This assures that bidirectional IRIs can be processed in the same way
as other IRIs. as other IRIs.
When rendered, bidirectional IRIs MUST be rendered using the Unicode When rendered, bidirectional IRIs MUST be rendered using the Unicode
Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be
rendered in the same way as they would be rendered if they were in an rendered in the same way as they would be rendered if they were in an
left-to-right embedding, i.e. as if they were preceded by U+202A, left-to-right embedding, i.e. as if they were preceded by U+202A,
LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP
DIRECTIONAL FORMATTING (PDF). Setting the embedding direction can DIRECTIONAL FORMATTING (PDF). Setting the embedding direction can
also be done in a higher-order protocol (e.g. the dir='ltr' attribute also be done in a higher-level protocol (e.g. the dir='ltr'
in HTML). attribute in HTML).
There is no requirement to actually use the above embedding if the There is no requirement to actually use the above embedding if the
display is still the same without the embedding. For example, a display is still the same without the embedding. For example, a
bidirectional IRI in a text with left-to-right base directionality bidirectional IRI in a text with left-to-right base directionality
(such as used for English or Cyrillic) that is preceded and followed (such as used for English or Cyrillic) that is preceded and followed
by whitespace and strong left-to-right characters does not need an by whitespace and strong left-to-right characters does not need an
embedding. Also, a bidirectional relative IRI that only contains embedding. Also, a bidirectional relative IRI that only contains
strong right-to-left characters and weak characters and that starts strong right-to-left characters and weak characters and that starts
and ends with a strong rigth-to-left character and appears in a text and ends with a strong rigth-to-left character and appears in a text
with right-to-left base directionality (such as used for Arabic or with right-to-left base directionality (such as used for Arabic or
skipping to change at page 18, line 11 skipping to change at page 18, line 47
The Unicode Bidirectional Algorithm ([UNI9], Section 4.3) permits The Unicode Bidirectional Algorithm ([UNI9], Section 4.3) permits
higher-level protocols to influence bidirectional rendering. Such higher-level protocols to influence bidirectional rendering. Such
changes by higher-level protocols MUST NOT be used if they change the changes by higher-level protocols MUST NOT be used if they change the
rendering of IRIs. rendering of IRIs.
The bidirectional formatting characters that may be used before or The bidirectional formatting characters that may be used before or
after the IRI to assure correct display are themselves not part of after the IRI to assure correct display are themselves not part of
the IRI. IRIs MUST NOT contain bidirectional formatting characters the IRI. IRIs MUST NOT contain bidirectional formatting characters
(LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual
rendering of the IRI, but do not themselves appear visually. It would rendering of the IRI, but do not themselves appear visually. It
therefore not be possible to correctly input an IRI with such would therefore not be possible to correctly input an IRI with such
characters. characters.
4.2 Bidi IRI Structure 4.2 Bidi IRI Structure
The Unicode Bidirectional Algorithm is designed mainly for running The Unicode Bidirectional Algorithm is designed mainly for running
text. To make sure that it does not affect the rendering of text. To make sure that it does not affect the rendering of
bidirectional IRIs too much, some restrictions on bidirectional IRIs bidirectional IRIs too much, some restrictions on bidirectional IRIs
are necessary. These restrictions are given in terms of delimiters are necessary. These restrictions are given in terms of delimiters
(structural characters, mostly punctuation such as '@', '.', ':', (structural characters, mostly punctuation such as '@', '.', ':',
'/') and components (usually consisting mostly of letters and '/') and components (usually consisting mostly of letters and
skipping to change at page 18, line 34 skipping to change at page 19, line 24
The following syntax rules from Section 2.2 correspond to components The following syntax rules from Section 2.2 correspond to components
for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment, for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment,
isegment-nz, isegment-nzc, ireg-name, iquery, and ifragment. isegment-nz, isegment-nzc, ireg-name, iquery, and ifragment.
Specifications that define the syntax of any of the above components Specifications that define the syntax of any of the above components
MAY divide them further and define smaller parts to be components MAY divide them further and define smaller parts to be components
according to this document. As an example, the restrictions of according to this document. As an example, the restrictions of
[RFC3490] on bidirectional domain names correspond to treating each [RFC3490] on bidirectional domain names correspond to treating each
label of a domain name as a component for those schemes where label of a domain name as a component for those schemes where
ireg-name is a domain name. Even where the components are not defined ireg-name is a domain name. Even where the components are not
formally, it may be helpful to think about some syntax in terms of defined formally, it may be helpful to think about some syntax in
components and to apply the relevant restrictions. For example, for terms of components and to apply the relevant restrictions. For
the usual name/value syntax in query parts, it is convenient to treat example, for the usual name/value syntax in query parts, it is
each name and each value as a component. As another example, the convenient to treat each name and each value as a component. As
extensions in a resource name can be treated as separate components. another example, the extensions in a resource name can be treated as
separate components.
For each component, the following restrictions apply: For each component, the following restrictions apply:
1) A component SHOULD NOT use both right-to-left and left-to-right 1) A component SHOULD NOT use both right-to-left and left-to-right
characters. characters.
2) A component using right-to-left characters SHOULD start and end 2) A component using right-to-left characters SHOULD start and end
with right-to-left characters. with right-to-left characters.
The above restrictions are given as shoulds, rather than as musts. The above restrictions are given as shoulds, rather than as musts.
skipping to change at page 20, line 14 skipping to change at page 21, line 4
inverted as a whole: inverted as a whole:
logical representation: http://ab.CDE.FGH/ij/kl/mn/op.html logical representation: http://ab.CDE.FGH/ij/kl/mn/op.html
visual representation: http://ab.HGF.EDC/ij/kl/mn/op.html visual representation: http://ab.HGF.EDC/ij/kl/mn/op.html
A sequence of rtl components is read rtl, in the same way as a A sequence of rtl components is read rtl, in the same way as a
sequence of rtl words is read rtl in a bidi text. sequence of rtl words is read rtl in a bidi text.
Example 3: All components of an IRI (except for the scheme) are rtl. Example 3: All components of an IRI (except for the scheme) are rtl.
All rtl components are inverted overall: All rtl components are inverted overall:
logical representation: http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV logical representation: http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV
visual representation: http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA visual representation: http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA
The whole IRI (except the scheme) is read rtl. Delimiters between rtl The whole IRI (except the scheme) is read rtl. Delimiters between
components stay between the respective components; delimiters between rtl components stay between the respective components; delimiters
ltr and rtl components don't move. between ltr and rtl components don't move.
Example 4: Several sequences of rtl components are each inverted on Example 4: Several sequences of rtl components are each inverted on
their own: their own:
logical representation: http://AB.CD.ef/gh/IJ/KL.html logical representation: http://AB.CD.ef/gh/IJ/KL.html
visual representation: http://DC.BA.ef/gh/LK/JI.html visual representation: http://DC.BA.ef/gh/LK/JI.html
Each sequence of rtl components is read rtl, in the same way as each Each sequence of rtl components is read rtl, in the same way as each
sequence of rtl words in an ltr text is read rtl. sequence of rtl words in an ltr text is read rtl.
Example 5: Example 2, applied to components of different kinds: Example 5: Example 2, applied to components of different kinds:
logical representation: http://ab.cd.EF/GH/ij/kl.html logical representation: http://ab.cd.EF/GH/ij/kl.html
skipping to change at page 21, line 27 skipping to change at page 22, line 17
Example 10 (allowed, but not recommended): Example 10 (allowed, but not recommended):
logical representation: http://ab.CDEFGH.123/kl/mn/op.html logical representation: http://ab.CDEFGH.123/kl/mn/op.html
visual representation: http://ab.123.HGFEDC/kl/mn/op.html visual representation: http://ab.123.HGFEDC/kl/mn/op.html
Components consisting of only numbers are allowed (it would be rather Components consisting of only numbers are allowed (it would be rather
difficult to prohibit them), but may interact with adjacent RTL difficult to prohibit them), but may interact with adjacent RTL
components in ways that are not easy to predict. components in ways that are not easy to predict.
5. IRI Equivalence and Comparison 5. IRI Equivalence and Comparison
This section discusses IRI Equivalence and Comparison similar to This section discusses IRI Equivalence and Comparison similar to
Section 6, "Normalization and Comparison", in [RFCYYYY]. This section Section 6, "Normalization and Comparison", in [RFCYYYY]. This
focuses on the main issues and on aspects that are different from section focuses on the main issues and on aspects that are different
[RFCYYYY]; Section 6 of [RFCYYYY] is recommended background reading. from [RFCYYYY]; Section 6 of [RFCYYYY] is recommended background
reading.
There is no general rule or procedure to decide whether two arbitrary There is no general rule or procedure to decide whether two arbitrary
IRIs are equivalent or not (i.e. whether they refer to the same IRIs are equivalent or not (i.e. whether they refer to the same
resource or not). Two IRIs that look almost the same may refer to resource or not). Two IRIs that look almost the same may refer to
different resources. Two IRIs that look completely different may different resources. Two IRIs that look completely different may
refer to the same resource. Each specification or application that refer to the same resource. Each specification or application that
uses IRIs has to decide on the appropriate criterion for IRI uses IRIs has to decide on the appropriate criterion for IRI
equivalence. equivalence.
5.1 Simple String Comparison 5.1 Simple String Comparison
skipping to change at page 21, line 51 skipping to change at page 22, line 42
In some scenarios a definite answer to the question of IRI In some scenarios a definite answer to the question of IRI
equivalence is needed that is independent of the scheme used and equivalence is needed that is independent of the scheme used and
always can be calculated quickly and without accessing a network. An always can be calculated quickly and without accessing a network. An
example of such a case is XML Namespaces ([XMLNamespace]). In such example of such a case is XML Namespaces ([XMLNamespace]). In such
cases, two IRIs SHOULD be defined as equivalent if and only if they cases, two IRIs SHOULD be defined as equivalent if and only if they
are character-by-character equivalent. This is the same as being are character-by-character equivalent. This is the same as being
byte-by-byte equivalent if the character encoding for both IRIs is byte-by-byte equivalent if the character encoding for both IRIs is
the same. As an example, the same. As an example,
http://example.org/~user, http://example.org/%7euser, and http://example.org/~user, http://example.org/%7euser, and
http://example.org/%7Euser are not equivalent under this definition. http://example.org/%7Euser are not equivalent under this definition.
In such a case, the comparison function MUST NOT map IRIs to URIs, When comparing character-by-character, the comparison function MUST
because such a mapping would create additional spurious equivalences. NOT map IRIs to URIs, because such a mapping would create additional
spurious equivalences.
It follows that IRIs SHOULD NOT be modified when being transported if It follows that IRIs SHOULD NOT be modified when being transported if
there is any chance that this IRI might be used as an identifier in there is any chance that this IRI might be used as an identifier in
the way explained above. the way explained above. When an IRI is used as an identifier in
scenarios that depend upon character-by-character equivalence,
creators of IRIs should take additional care to avoid IRIs that only
differ in their use of percent-escaping. As an example, using both
http://example.org/~user and http://example.org/%7Euser to identify
XML Namespaces is a bad idea.
5.2 Conversion to URIs 5.2 Conversion to URIs
For actual resolution, differences in percent-encoding (except for For actual resolution, differences in percent-encoding (except for
the percent-encoding of reserved characters) MUST always result in the percent-encoding of reserved characters) MUST always result in
the same resource. For example, http://example.org/~user, the same resource. For example, http://example.org/~user,
http://example.org/%7euser and http://example.org/%7Euser must http://example.org/%7euser and http://example.org/%7Euser must
resolve to the same resource. resolve to the same resource.
If this kind of equivalence is to be tested, the percent-encoding of If this kind of equivalence is to be tested, the percent-encoding of
both IRIs to be compared has to be aligned, for example by converting both IRIs to be compared has to be aligned, for example by converting
both IRIs to URIs (see Section 3.1) and making sure that the case of both IRIs to URIs (see Section 3.1), eliminating escape differences
the hexadecimal characters in the percent-encode is always the same in the resulting URIs, and making sure that the case of the
(preferably upper case). For comparison, such conversions MUST only hexadecimal characters in the percent-encodeing is always the same
be done on the fly, while retaining the original IRI. (preferably upper case). If the IRI is to be passed to another
application, or used further in some other way, its original form
MUST be preserved; the conversion described here should be performed
only for the purpose of local comparison.
Additional, similar equivalences are possible based on knowledge Additional, similar equivalences are possible based on knowledge
about the generic URI/IRI syntax, such as the fact that the scheme about the generic URI/IRI syntax, such as the fact that the scheme
part is case-insensitive. part is case-insensitive.
5.3 Normalization 5.3 Normalization
The Unicode Standard [UNIV4] defines various equivalences between The Unicode Standard [UNIV4] defines various equivalences between
sequences of characters for various purposes. Unicode Standard Annex sequences of characters for various purposes. Unicode Standard Annex
#15 [UTR15] defines various Normalization Forms for these #15 [UTR15] defines various Normalization Forms for these
skipping to change at page 22, line 51 skipping to change at page 23, line 51
comparing two IRIs. The exceptions are conversion from a non-digital comparing two IRIs. The exceptions are conversion from a non-digital
form, and conversion from a non-UCS-based character encoding to an form, and conversion from a non-UCS-based character encoding to an
UCS-based character encoding. In these cases, NFC or a normalizing UCS-based character encoding. In these cases, NFC or a normalizing
transcoder using NFC MUST be used for interoperability. To avoid transcoder using NFC MUST be used for interoperability. To avoid
false negatives and problems with transcoding, IRIs SHOULD be created false negatives and problems with transcoding, IRIs SHOULD be created
using NFC. Using NFKC may avoid even more problems, for example by using NFC. Using NFKC may avoid even more problems, for example by
choosing half-width Latin letters instead of full-width, and choosing half-width Latin letters instead of full-width, and
full-width Katakana instead of half-width. full-width Katakana instead of half-width.
As an example, http://www.example.org/r&#xE9;sum&#xE9;.html (in XML As an example, http://www.example.org/r&#xE9;sum&#xE9;.html (in XML
Notation) is in NFC. On the other hand, http://www.example.org/ Notation) is in NFC. On the other hand,
re&#x301;sume&#x301;.html is not in NFC. The former uses precombined http://www.example.org/re&#x301;sume&#x301;.html is not in NFC. The
e-acute characters, the later uses 'e' characters followed by former uses precombined e-acute characters, the later uses 'e'
combining acute accents. Both usages are defined to be canonically characters followed by combining acute accents. Both usages are
equivalent in [UNIV4]. defined to be canonically equivalent in [UNIV4].
Note: Because it is unknown how a particular field is being treated Note: Because it is unknown how a particular field is being treated
with respect to text normalization, it would be inappropriate to with respect to text normalization, it would be inappropriate to
allow third parties to normalize an IRI arbitrarily. This does not allow third parties to normalize an IRI arbitrarily. This does
contradict the recommendation that when a resource is created, its not contradict the recommendation that when a resource is created,
IRI should be as normalized as possible (i.e. NFC or even NFKC). its IRI should be as normalized as possible (i.e. NFC or even
This is similar to the upper-case/lower-case problems in URIs. NFKC). This is similar to the upper-case/lower-case problems in
Some parts of a URI are case-insensitive (domain name). For URIs. Some parts of a URI are case-insensitive (domain name).
others, it is unclear whether they are case-sensitive or For others, it is unclear whether they are case-sensitive or
case-insensitive, or something in between (e.g. case-sensitive, case-insensitive, or something in between (e.g. case-sensitive,
but if the wrong case is used, a multiple choice selection is but if the wrong case is used, a multiple choice selection is
provided instead of a direct negative result). The best recipe is provided instead of a direct negative result). The best recipe is
that the creator uses a reasonable capitalization, and when that the creator uses a reasonable capitalization, and when
transferring the URI, that capitalization is never changed. transferring the URI, that capitalization is never changed.
Various IRI schemes may allow the usage of International Domain Names Various IRI schemes may allow the usage of International Domain Names
(IDN) [RFC3490]. When in use in IRIs, those names SHOULD be validated (IDN) [RFC3490]. When in use in IRIs, those names SHOULD be
using the ToASCII operation defined in [RFC3490], with the flags validated using the ToASCII operation defined in [RFC3490], with the
"UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing an flags "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing
invalid IDN cannot successfully be resolved. For legibility purposes, an invalid IDN cannot successfully be resolved. For legibility
IDN components of IRIs SHOULD NOT be converted into ASCII Compatible purposes, IDN components of IRIs SHOULD NOT be converted into ASCII
Encoding (ACE). Compatible Encoding (ACE).
5.4 Preferred Forms 5.4 Preferred Forms
The following are the preferred forms for IRIs when created: The following are the preferred forms for IRIs when created:
- Always provide the URI scheme in lowercase characters. - Always provide the URI scheme in lowercase characters.
- Only perform percent-encoding where it is essential. - Only perform percent-encoding where it is essential.
- Always use uppercase A-through-F characters when percent-encoding. - Always use uppercase A-through-F characters when percent-encoding.
skipping to change at page 24, line 13 skipping to change at page 25, line 13
- Prevent /./ and /../ from appearing in non-relative URI paths. - Prevent /./ and /../ from appearing in non-relative URI paths.
- For schemes that define an empty path to be equivalent to a path - For schemes that define an empty path to be equivalent to a path
of "/", use "/". of "/", use "/".
6. Use of IRIs 6. Use of IRIs
6.1 Limitations on UCS Characters Allowed in IRIs 6.1 Limitations on UCS Characters Allowed in IRIs
This section discusses limitations on characters and character This section discusses limitations on characters and character
sequences usable for IRIs. The considerations in this section are sequences usable for IRIs beyond those given in Section 2.2 and
relevant when creating IRIs and when converting from URIs to IRIs. Section 4.1. The considerations in this section are relevant when
creating IRIs and when converting from URIs to IRIs.
a) The repertoire of characters allowed in each IRI component is a) The repertoire of characters allowed in each IRI component is
limited by the definition of that component. For example, the limited by the definition of that component. For example, the
definition of the scheme component does not allow characters definition of the scheme component does not allow characters
beyond US-ASCII. beyond US-ASCII.
(Note: In accordance with URI practice, generic IRI software (Note: In accordance with URI practice, generic IRI software
cannot and should not check for such limitations.) cannot and should not check for such limitations.)
b) The UCS contains many areas of characters for which there are b) The UCS contains many areas of characters for which there are
strong visual look-alikes. Because of the likelihood of strong visual look-alikes. Because of the likelihood of
transcription errors, these also should be avoided. This includes transcription errors, these also should be avoided. This includes
the full-width equivalents of ASCII characters, half-width the full-width equivalents of Latin characters, half-width
Katakana characters for Japanese, and many others. This also Katakana characters for Japanese, and many others. This also
includes many look-alikes of "space", "delims", and "unwise", includes many look-alikes of "space", "delims", and "unwise",
characters excluded in [RFC3491]. characters excluded in [RFC3491].
Additional information is available from [UNIXML]. [UNIXML] is Additional information is available from [UNIXML]. [UNIXML] is
written in the context of running text rather than in the context of written in the context of running text rather than in the context of
identifiers. Nevertheless, it discusses many of the categories of identifiers. Nevertheless, it discusses many of the categories of
characters not appropriate for IRIs. characters not appropriate for IRIs.
6.2 Software Interfaces and Protocols 6.2 Software Interfaces and Protocols
skipping to change at page 25, line 35 skipping to change at page 26, line 35
formats and protocols will be required to handle IRIs [CharMod]. formats and protocols will be required to handle IRIs [CharMod].
6.4 Use of UTF-8 for Encoding Original Characters 6.4 Use of UTF-8 for Encoding Original Characters
This section discusses details and gives examples for point c) in This section discusses details and gives examples for point c) in
Section 1.2. In order to be able to use IRIs, the URI corresponding Section 1.2. In order to be able to use IRIs, the URI corresponding
to the IRI in question has to encode original characters into octets to the IRI in question has to encode original characters into octets
using UTF-8. This can be specified for all URIs of a URI scheme, or using UTF-8. This can be specified for all URIs of a URI scheme, or
can apply to individual URIs for schemes that do not specify how to can apply to individual URIs for schemes that do not specify how to
encode original characters. It can apply to the whole URI, or only encode original characters. It can apply to the whole URI, or only
some part. some part. For background information on encoding characters into
URIs, see also Section 2.5 of [RFCYYYY].
For new URI schemes, using UTF-8 is recommended in [RFC2718]. For new URI schemes, using UTF-8 is recommended in [RFC2718].
Examples where this is already used are the URN syntax [RFC2141], Examples where this is already used are the URN syntax [RFC2141],
IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand,
because the HTTP URL scheme does not specify how to encode original because the HTTP URL scheme does not specify how to encode original
characters, only some HTTP URLs can have corresponding but different characters, only some HTTP URLs can have corresponding but different
IRIs. IRIs.
For example, for a document with a URI of For example, for a document with a URI of
http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to
construct a corresponding IRI (in XML notation, see Section 1.4): construct a corresponding IRI (in XML notation, see Section 1.4):
http://www.example.org/r&#xE9;sum&#xE9;.html (&#xE9; stands for the http://www.example.org/r&#xE9;sum&#xE9;.html (&#xE9; stands for the
e-acute character, and %C3%A9 is the UTF-8 encoded and e-acute character, and %C3%A9 is the UTF-8 encoded and
percent-encoded representation of that character). On the other hand, percent-encoded representation of that character). On the other
for a document with a URI of http://www.example.org/r%E9sum%E9.html, hand, for a document with a URI of
the percent-encoding octets cannot be converted to actual characters http://www.example.org/r%E9sum%E9.html, the percent-encoding octets
in an IRI, because the percent-encoding is not based on UTF-8. cannot be converted to actual characters in an IRI, because the
percent-encoding is not based on UTF-8.
The requirement for the use of UTF-8 applies to all parts of a URI The requirement for the use of UTF-8 applies to all parts of a URI
(with the potential exception of the ireg-name part, see Section (with the potential exception of the ireg-name part, see Section
3.1). However, it is possible that the capability of IRIs to 3.1). However, it is possible that the capability of IRIs to
represent a wide range of characters directly is used just in some represent a wide range of characters directly is used just in some
parts of the IRI (or IRI reference). The other parts of the IRI may parts of the IRI (or IRI reference). The other parts of the IRI may
only contain ASCII characters, or they may not be based on UTF-8. only contain US-ASCII characters, or they may not be based on UTF-8.
They may be based on another character encoding, or they may directly They may be based on another character encoding, or they may directly
encode raw binary data (see also [RFC2397]). encode raw binary data (see also [RFC2397]).
For example, it is possible to have a URI reference of For example, it is possible to have a URI reference of
http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9, where the http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9, where the
document name is encoded in iso-8859-1 based on server settings, but document name is encoded in iso-8859-1 based on server settings, but
the fragment identifier is encoded in UTF-8 according to [XPointer]. the fragment identifier is encoded in UTF-8 according to [XPointer].
The IRI corresponding to the above URI would be (in XML notation) The IRI corresponding to the above URI would be (in XML notation)
http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9;. http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9;.
skipping to change at page 27, line 10 skipping to change at page 28, line 14
In case the current handling in an API or protocol is based on In case the current handling in an API or protocol is based on
US-ASCII, UTF-8 is recommended as the character encoding for IRIs, US-ASCII, UTF-8 is recommended as the character encoding for IRIs,
because this is compatible with US-ASCII, is in accordance with the because this is compatible with US-ASCII, is in accordance with the
recommendations of [RFC2277], and makes it easy to convert to URIs recommendations of [RFC2277], and makes it easy to convert to URIs
where necessary. In any case, the API or protocol definition must where necessary. In any case, the API or protocol definition must
clearly define the character encoding to be used. clearly define the character encoding to be used.
The transfer from URI-only to IRI-capable components requires no The transfer from URI-only to IRI-capable components requires no
mapping, although the conversion described in Section 3.2 above may mapping, although the conversion described in Section 3.2 above may
be performed. It is preferable not to perform this inverse conversion be performed. It is preferable not to perform this inverse
when there is a chance that this cannot be done correctly. conversion when there is a chance that this cannot be done correctly.
7.2 URI/IRI Entry 7.2 URI/IRI Entry
There are components that allow users to enter URIs into the system, There are components that allow users to enter URIs into the system,
for example by typing or dictation. This software must be updated to for example by typing or dictation. This software must be updated to
allow for IRI entry. allow for IRI entry.
A person viewing a visual representation of an IRI (as a sequence of A person viewing a visual representation of an IRI (as a sequence of
glyphs, in some order, in some visual display) or hearing an IRI, glyphs, in some order, in some visual display) or hearing an IRI,
will use a entry method for characters in the user's language to will use a entry method for characters in the user's language to
skipping to change at page 27, line 36 skipping to change at page 28, line 40
restrictions defined in Section 2.2 are met. This may be done by restrictions defined in Section 2.2 are met. This may be done by
choosing appropriate input methods or variants/settings thereof, by choosing appropriate input methods or variants/settings thereof, by
appropriately converting the characters being input, by eliminating appropriately converting the characters being input, by eliminating
characters that cannot be converted, and/or by issuing a warning or characters that cannot be converted, and/or by issuing a warning or
error message to the user. error message to the user.
As an example of variant settings, input method editors for East As an example of variant settings, input method editors for East
Asian Languages usually allow the input of Latin letters and related Asian Languages usually allow the input of Latin letters and related
characters in full-width or half-width versions. For IRI input, the characters in full-width or half-width versions. For IRI input, the
input method editor should be set so that it produces half-width input method editor should be set so that it produces half-width
Latin letters, and full-width Katakana. Latin letters and punctuation, and full-width Katakana.
An input field primarily or only used for the input of URIs/IRIs may An input field primarily or only used for the input of URIs/IRIs may
allow the user to view an IRI as mapped to a URI. Places where the allow the user to view an IRI as mapped to a URI. Places where the
input of IRIs is frequent may provide the possibility for viewing an input of IRIs is frequent may provide the possibility for viewing an
IRI as mapped to a URI. This will help users when some of the IRI as mapped to a URI. This will help users when some of the
software they use does not yet accept IRIs. software they use does not yet accept IRIs.
An IRI input component that interfaces to components that handle An IRI input component that interfaces to components that handle
URIs, but not IRIs, must map the IRI to a URI before passing it to URIs, but not IRIs, must map the IRI to a URI before passing it to
such a component. such a component.
skipping to change at page 29, line 13 skipping to change at page 30, line 13
servers, similar considerations apply, see in particular [RFC2640]. servers, similar considerations apply, see in particular [RFC2640].
7.5 URI/IRI Selection 7.5 URI/IRI Selection
In some cases, resource owners and publishers have control over the In some cases, resource owners and publishers have control over the
IRIs used to identify their resources. Such control is mostly IRIs used to identify their resources. Such control is mostly
executed by controlling the resource names, such as file names, executed by controlling the resource names, such as file names,
directly. directly.
In such cases, it is recommended to avoid choosing IRIs that are In such cases, it is recommended to avoid choosing IRIs that are
easily confused. For example, for US-ASCII, the lower-case ell "l" is easily confused. For example, for US-ASCII, the lower-case ell "l"
easily confused with the digit one "1", and the upper-case oh "O" is is easily confused with the digit one "1", and the upper-case oh "O"
easily confused with the digit zero "0". Publishers should avoid is easily confused with the digit zero "0". Publishers should avoid
confusing users with "br0ken" or "1ame" identifiers. confusing users with "br0ken" or "1ame" identifiers.
Outside of the US-ASCII range, there are many more opportunities for Outside of the US-ASCII repertoire, there are many more opportunities
confusion; a complete set of guidelines is too lengthy to include for confusion; a complete set of guidelines is too lengthy to include
here. As long as names are limited to characters from a single here. As long as names are limited to characters from a single
script, native writers of a given script or language will know best script, native writers of a given script or language will know best
when ambiguities can appear, and how they can be avoided. What may when ambiguities can appear, and how they can be avoided. What may
look ambiguous to a stranger may be completely obvious to the average look ambiguous to a stranger may be completely obvious to the average
native user. On the other hand, in some cases, the UCS contains native user. On the other hand, in some cases, the UCS contains
variants for compatibility reasons, for example for typographic variants for compatibility reasons, for example for typographic
purposes. These should be avoided wherever possible. Although there purposes. These should be avoided wherever possible. Although there
may be exceptions, in general newly created resource names should be may be exceptions, in general newly created resource names should be
in NFKC [UTR15] (which means that they are also in NFC). in NFKC [UTR15] (which means that they are also in NFC).
skipping to change at page 30, line 28 skipping to change at page 31, line 28
encodings than UTF-8. Such URIs may be produced by user agents that encodings than UTF-8. Such URIs may be produced by user agents that
do not conform to this specification and use legacy character do not conform to this specification and use legacy character
encodings to convert non-ASCII characters to URIs. Whether this is encodings to convert non-ASCII characters to URIs. Whether this is
necessary and what character encodings to cover, depends on a number necessary and what character encodings to cover, depends on a number
of factors, such as the legacy character encodings used locally and of factors, such as the legacy character encodings used locally and
the distribution of various versions of user agents. For example, the distribution of various versions of user agents. For example,
software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in
addition to UTF-8. addition to UTF-8.
Third, it may include additional mappings to be more user-friendly Third, it may include additional mappings to be more user-friendly
and robust against transmission errors. These would be similar to how and robust against transmission errors. These would be similar to
currently some servers treat URIs as case-insensitive, or perform how currently some servers treat URIs as case-insensitive, or perform
additional matching to account for spelling errors. For characters additional matching to account for spelling errors. For characters
beyond the ASCII repertoire, this may for example include ignoring beyond the US-ASCII repertoire, this may for example include ignoring
the accents on received IRIs or resource names where appropriate. the accents on received IRIs or resource names where appropriate.
Please note that such mappings, including case mappings, are Please note that such mappings, including case mappings, are
language-dependent. language-dependent.
It can be difficult to unambiguously identify a resource if too many It can be difficult to unambiguously identify a resource if too many
mappings are taken into consideration. However, percent-encoded and mappings are taken into consideration. However, percent-encoded and
not percent-encoded parts of IRIs can always clearly be not percent-encoded parts of IRIs can always clearly be
distinguished. Also, the regularity of UTF-8 (see [Duerst97]) makes distinguished. Also, the regularity of UTF-8 (see [Duerst97]) makes
the potential for collisions lower than it may seem at first sight. the potential for collisions lower than it may seem at first sight.
skipping to change at page 31, line 21 skipping to change at page 32, line 21
individual IRI, care should be taken to upgrade the corresponding individual IRI, care should be taken to upgrade the corresponding
interpreting software in order to cover the forms expected to be interpreting software in order to cover the forms expected to be
received by various versions of entry and transport software. received by various versions of entry and transport software.
The upgrade of generating software to generate IRIs instead of using The upgrade of generating software to generate IRIs instead of using
a local character encoding should happen only after the service is a local character encoding should happen only after the service is
upgraded to accept IRIs. Similarly, IRIs should only be generated upgraded to accept IRIs. Similarly, IRIs should only be generated
when the service accepts IRIs and the intervening infrastructure and when the service accepts IRIs and the intervening infrastructure and
protocol is known to transport them safely. protocol is known to transport them safely.
Display software should be upgraded only after upgraded entry Software converting from URIs to IRIs for display should be upgraded
software has been widely deployed to the population that will see the only after upgraded entry software has been widely deployed to the
displayed result. population that will see the displayed result.
It is often possible to reduce the effort and dependencies for It is often possible to reduce the effort and dependencies for
upgrading to IRIs by using UTF-8 rather than another character upgrading to IRIs by using UTF-8 rather than another character
encoding where there is a free choice of character encodings. For encoding where there is a free choice of character encodings. For
example, when setting up a new file-based Web server, using UTF-8 as example, when setting up a new file-based Web server, using UTF-8 as
the character encoding for file names will make the transition to the character encoding for file names will make the transition to
IRIs easier. Likewise, when setting up a new Web form using UTF-8 as IRIs easier. Likewise, when setting up a new Web form using UTF-8 as
the character encoding of the form page, the returned query URIs will the character encoding of the form page, the returned query URIs will
use UTF-8 as the character encoding (unless the user, for whatever use UTF-8 as the character encoding (unless the user, for whatever
reason, changes the character encoding) and will therefore be reason, changes the character encoding) and will therefore be
compatible with IRIs. compatible with IRIs.
These recommendations, when taken together, will allow for the These recommendations, when taken together, will allow for the
extension from URIs to IRIs in order to handle scripts other than extension from URIs to IRIs in order to handle characters other than
ASCII while minimizing interoperability problems. US-ASCII while minimizing interoperability problems.
8. Security Considerations 8. Security Considerations
The security considerations discussed in [RFCYYYY] also apply to The security considerations discussed in [RFCYYYY] also apply to
IRIs. In addition, the following issues require particular care for IRIs. In addition, the following issues require particular care for
IRIs. IRIs.
Incorrect encoding or decoding can lead to security problems. In Incorrect encoding or decoding can lead to security problems. In
particular, some UTF-8 decoders do not check against overlong byte particular, some UTF-8 decoders do not check against overlong byte
sequences. As an example, a '/' is encoded with the byte 0x2F both in sequences. As an example, a '/' is encoded with the byte 0x2F both
UTF-8 and in ASCII, but some UTF-8 decoders also wrongly interpret in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly
the sequence 0xC0 0xAF as a '/'. A sequence such as '%C0%AF..' may interpret the sequence 0xC0 0xAF as a '/'. A sequence such as
pass some security tests and then be interpreted as '/..' in a path '%C0%AF..' may pass some security tests and then be interpreted as '/
if UTF-8 decoders are fault-tolerant, if conversion and checking are ..' in a path if UTF-8 decoders are fault-tolerant, if conversion and
not done in the right order, and/or if reserved characters and checking are not done in the right order, and/or if reserved
unreserved characters are not clearly distinguished. characters and unreserved characters are not clearly distinguished.
There are various ways in which "spoofing" can occur with IRIs. There are various ways in which "spoofing" can occur with IRIs.
"Spoofing" means that somebody may add a resource name that looks the "Spoofing" means that somebody may add a resource name that looks the
same or similar to the user, but points to a different resource. The same or similar to the user, but points to a different resource. The
added resource may pretend to be the real resource by looking very added resource may pretend to be the real resource by looking very
similar, but may contain all kinds of changes that may be difficult similar, but may contain all kinds of changes that may be difficult
to spot and can cause all kinds of problems. Most spoofing to spot and can cause all kinds of problems. Most spoofing
possibilities for IRIs are extensions of those for URIs. possibilities for IRIs are extensions of those for URIs.
Spoofing can occur for various reasons. A first reason is that Spoofing can occur for various reasons. A first reason is that
normalization expectations of a user or actual normalization when normalization expectations of a user or actual normalization when
entering an IRI, or when transcoding an IRI from a legacy character entering an IRI, or when transcoding an IRI from a legacy character
encoding, do not match the normalization used on the server side. encoding, do not match the normalization used on the server side.
Conceptually, this is no different from the problems surrounding the Conceptually, this is no different from the problems surrounding the
use of case-insensitive web servers. For example, a popular web page use of case-insensitive web servers. For example, a popular web page
with a mixed case name (http://big.site/PopularPage.html) might be with a mixed case name (http://big.example.com/PopularPage.html)
"spoofed" by someone who is able to create http://big.site/ might be "spoofed" by someone who is able to create
popularpage.html. However, the use of unnormalized character http://big.example.com/popularpage.html. However, the use of
sequences, and of additional mappings for user convenience, may unnormalized character sequences, and of additional mappings for user
increase the chance for spoofing. Protocols and servers that allow convenience, may increase the chance for spoofing. Protocols and
the creation of resources with unnormalized names, and resources with servers that allow the creation of resources with names that are not
names that are not normalized, are particularly vulnerable to such normalized are particularly vulnerable to such attacks. This is an
attacks. This is an inherent security problem of the relevant inherent security problem of the relevant protocol, server, or
protocol, server, or resource, and not specific to IRIs, but resource, and not specific to IRIs, but mentioned here for
mentioned here for completeness. completeness.
Spoofing can occur in various IRI components, such as the domain name Spoofing can occur in various IRI components, such as the domain name
part or a path part. For considerations specific to the domain name part or a path part. For considerations specific to the domain name
part, see [RFC3491]. For the path part, administrators of sites which part, see [RFC3491]. For the path part, administrators of sites
allow independent users to create resources in the same subarea may which allow independent users to create resources in the same subarea
need to be careful to check for spoofing. may need to be careful to check for spoofing.
Spoofing can occur because in the UCS, there are many characters that Spoofing can occur because in the UCS, there are many characters that
look very similar. Details are discussed in Section 7.5. Again, this look very similar. Details are discussed in Section 7.5. Again,
is very similar to spoofing possibilities on US-ASCII, e.g. using this is very similar to spoofing possibilities on US-ASCII, e.g.
'br0ken' or '1ame' URIs. using 'br0ken' or '1ame' URIs.
Spoofing can occur when URIs with percent-encodings based on various Spoofing can occur when URIs with percent-encodings based on various
character encodings are accepted to deal with older user agents. In character encodings are accepted to deal with older user agents. In
some cases, in particular for Latin-based resource names, this is some cases, in particular for Latin-based resource names, this is
usually easy to detect because UTF-8-encoded names, when interpreted usually easy to detect because UTF-8-encoded names, when interpreted
and viewed as legacy character encodings, produce mostly garbage. In and viewed as legacy character encodings, produce mostly garbage. In
other cases, when concurrently used character encodings have a other cases, when concurrently used character encodings have a
similar structure, but there are no characters that have exactly the similar structure, but there are no characters that have exactly the
same encoding, detection is more difficult. same encoding, detection is more difficult.
Spoofing can occur with bidirectional IRIs, if the restrictions in Spoofing can occur with bidirectional IRIs, if the restrictions in
Section 4.2 are not followed. The same visual representation may be Section 4.2 are not followed. The same visual representation may be
interpreted as different logical representations, and vice versa. It interpreted as different logical representations, and vice versa. It
is also very important that a correct Unicode bidirectional is also very important that a correct Unicode bidirectional
implementation is used. implementation is used.
9. Acknowledgements 9. IANA Considerations
This document has no actions for IANA.
10. Acknowledgements
We would like to thank Larry Masinter for his work as coauthor of We would like to thank Larry Masinter for his work as coauthor of
many earlier versions of this document (draft-masinter-url-i18n-xx). many earlier versions of this document (draft-masinter-url-i18n-xx).
The discussion on the issue addressed here has started a long time The discussion on the issue addressed here has started a long time
ago. There was a thread in the HTML working group in August 1995 ago. There was a thread in the HTML working group in August 1995
(under the topic of "Globalizing URIs") and in the www-international (under the topic of "Globalizing URIs") and in the www-international
mailing list in July 1996 (under the topic of "Internationalization mailing list in July 1996 (under the topic of "Internationalization
and URLs"), and ad-hoc meetings at the Unicode conferences in and URLs"), and ad-hoc meetings at the Unicode conferences in
September 1995 and September 1997. September 1995 and September 1997.
Many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding, Many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding,
Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim
Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie
Daigle, Ted Hardie, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Daigle, Ted Hardie, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex
Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam
Costello, Dan Oscarson, Elliotte Rusty Harold, Mike J. Brown, Andrea Costello, Dan Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy
Vine, Roy Badami, Jonathan Rosenne, Asmus Freytag, Simon Josefsson, Badami, Jonathan Rosenne, Asmus Freytag, Simon Josefsson, Carlos
Carlos Viegas Damasio, Chris Haynes, Walter Underwood, and many Viegas Damasio, Chris Haynes, Walter Underwood, and many others for
others for help with understanding the issues and possible solutions, help with understanding the issues and possible solutions, and
and getting the details right. Thanks also to the members of the W3C getting the details right.
I18N Working Group and Interest Group for their contributions and
their work on [CharMod], to the members of many other W3C WGs for
adopting IRIs, and to the members of the Montreal IAB Workshop on
Internationalization and Localization for their review.
10. References This document is a product of the Internationalization Working Group
(I18N WG) of the World Wide Web Consortium (W3C). Thanks to the
members of the W3C I18N Working Group and Interest Group for their
contributions and their work on [CharMod]. Thanks also go to the
members of many other W3C Working Groups for adopting IRIs, and to
the members of the Montreal IAB Workshop on Internationalization and
Localization for their review.
10.1 Normative References 11. References
11.1 Normative References
[ASCII] American National Standards Institute, "Coded Character
Set -- 7-bit American Standard Code for Information
Interchange", ANSI X3.4, 1986.
[ISO10646] [ISO10646]
International Organization for Standardization, International Organization for Standardization, "ISO/IEC
"Information Technology - Universal Multiple-Octet Coded 10646:2003: Information Technology - Universal
Character Set (UCS) - Part 1: Architecture and Basic Multiple-Octet Coded Character Set (UCS)", ISO Standard
Multilingual Plane - Part 2: Supplementary Planes", ISO 10646, December 2003.
Standard 10646, with amendment, July 2002.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", RFC 2234, November 1997. Specifications: ABNF", RFC 2234, November 1997.
[RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello,
"Internationalizing Domain Names in Applications (IDNA)", "Internationalizing Domain Names in Applications (IDNA)",
RFC 3490, March 2003, <http://www.ietf.org/rfc/ RFC 3490, March 2003.
rfc3490.txt>.
[RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
Profile for Internationalized Domain Names (IDN)", RFC Profile for Internationalized Domain Names (IDN)", RFC
3491, March 2003. 3491, March 2003.
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
10646", STD 63, RFC 3629, November 2003, <http:// 10646", STD 63, RFC 3629, November 2003.
www.ietf.org/rfc/rfc3629.txt>.
[RFCYYYY] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform [RFCYYYY] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform
Resource Identifier (URI): Generic Syntax", Resource Identifier (URI): Generic Syntax (Note to the RFC
draft-fielding-uri-rfc2396bis-03.txt (work in progress), Editor: Please update this reference with the RFC
June 2003. resulting from draft-fielding-uri-rfc2396bis-xx.txt, and
remove this Note)", draft-fielding-uri-rfc2396bis-05.txt
(work in progress), April 2004.
[UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard
Annex #9, March 2004,
<http://www.unicode.org/reports/tr9/tr9-13.html>.
[UNIV4] The Unicode Consortium, "The Unicode Standard, Version
4.0.1, defined by: The Unicode Standard, Version 4.0
(Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1),
as amended by Unicode 4.0.1
(http://www.unicode.org/versions/Unicode4.0.1/)", March
2004.
[UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms",
Unicode Standard Annex #15, March 2001, <http:// Unicode Standard Annex #15, April 2003,
www.unicode.org/unicode/reports/tr15/tr15-21.html>. <http://www.unicode.org/unicode/reports/tr15/tr15-23.html>.
10.2 Non-normative References 11.2 Non-normative References
[BidiEx] "Examples of bidirectional IRIs", <http://www.w3.org/ [BidiEx] "Examples of bidirectional IRIs",
International/iri-edit/BidiExamples>. <http://www.w3.org/International/iri-edit/BidiExamples>.
[CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M. and T. [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M. and T.
Texin, "Character Model for the World Wide Web", World Texin, "Character Model for the World Wide Web", World
Wide Web Consortium Working Draft, August 2003, <http:// Wide Web Consortium Working Draft, February 2004, <http://
www.w3.org/TR/charmod>. www.w3.org/TR/charmod>.
[Duerst01] [Duerst01]
Duerst, M., "Internationalized Resource Identifiers: From Duerst, M., "Internationalized Resource Identifiers: From
Specification to Testing", Proc. 19th International Specification to Testing", Proc. 19th International
Unicode Conference, San Jose , September 2001, <http:// Unicode Conference, San Jose , September 2001,
www.w3.org/2001/Talks/0912-IUC-IRI/paper.html>. <http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html>.
[Duerst97] [Duerst97]
Duerst, M., "The Properties and Promises of UTF-8", Proc. Duerst, M., "The Properties and Promises of UTF-8", Proc.
11th International Unicode Conference, San Jose , 11th International Unicode Conference, San Jose ,
September 1997, <http://www.ifi.unizh.ch/mml/mduerst/ September 1997,
papers/PDF/IUC11-UTF-8.pdf>. <http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf>
.
[Gettys] Gettys, J., "URI Model Consequences", <http://www.w3.org/ [Gettys] Gettys, J., "URI Model Consequences",
DesignIssues/ModelConsequences>. <http://www.w3.org/DesignIssues/ModelConsequences>.
[HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01
Specification", World Wide Web Consortium Recommendation, Specification", World Wide Web Consortium Recommendation,
December 1999, <http://www.w3.org/TR/REC-html40/appendix/ December 1999,
notes.html#h-B.2>. <http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2>
.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
Atkinson, R., Crispin, M. and P. Svanberg, "The Report of Atkinson, R., Crispin, M. and P. Svanberg, "The Report of
the IAB Character Set Workshop held 29 February - 1 March, the IAB Character Set Workshop held 29 February - 1 March,
1996", RFC 2130, April 1997. 1996", RFC 2130, April 1997.
[RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.
[RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997. [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997.
skipping to change at page 35, line 41 skipping to change at page 37, line 16
[RFC2616] Fielding, R., Gettys, J., Mogul, J., Nielsen, H., [RFC2616] Fielding, R., Gettys, J., Mogul, J., Nielsen, H.,
Masinter, L., Leach, P. and T. Berners-Lee, "Hypertext Masinter, L., Leach, P. and T. Berners-Lee, "Hypertext
Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
[RFC2640] Curtin, B., "Internationalization of the File Transfer [RFC2640] Curtin, B., "Internationalization of the File Transfer
Protocol", RFC 2640, July 1999. Protocol", RFC 2640, July 1999.
[RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke, [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke,
"Guidelines for new URL Schemes", RFC 2718, November 1999. "Guidelines for new URL Schemes", RFC 2718, November 1999.
[UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard
Annex #9, March 2002, <http://www.unicode.org/unicode/
reports/tr9>.
[UNIV4] The Unicode Consortium, "The Unicode Standard, Version
4.0", Addison-Wesley, Reading, MA , 2003.
[UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other
Markup Languages", Unicode Technical Report #20, World Markup Languages", Unicode Technical Report #20, World
Wide Web Consortium Note, February 2002, <http:// Wide Web Consortium Note, February 2002,
www.w3.org/TR/unicode-xml/>. <http://www.w3.org/TR/unicode-xml/>.
[W3CIRI] Duerst, M., "Internationalization - URIs and other [W3CIRI] Duerst, M., "Internationalization - URIs and other
identifiers", World Wide Web Consortium Note, September identifiers", September 2002,
2002, <http://www.w3.org/International/ <http://www.w3.org/International/O-URL-and-ident.html>.
O-URL-and-ident.html>.
[XLink] DeRose, S., Maler, E. and D. Orchard, "XML Linking [XLink] DeRose, S., Maler, E. and D. Orchard, "XML Linking
Language (XLink) Version 1.0", World Wide Web Consortium Language (XLink) Version 1.0", World Wide Web Consortium
Recommendation, June 2001, <http://www.w3.org/TR/xlink/ Recommendation, June 2001,
#link-locators>. <http://www.w3.org/TR/xlink/#link-locators>.
[XML1] Bray, T., Paoli, J., Sperberg-McQueen, C. and E. Maler, [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E. and
"Extensible Markup Language (XML) 1.0 (Second Edition)", F. Yergeau, "Extensible Markup Language (XML) 1.0 (Third
World Wide Web Consortium Recommendation, including Edition)", World Wide Web Consortium Recommendation,
Erratum 26 at http://www.w3.org/XML/xml-V10-2e-errata#E26, February 2004,
October 2000, <http://www.w3.org/TR/ <http://www.w3.org/TR/REC-xml#sec-external-ent>.
REC-xml#sec-external-ent>.
[XMLNamespace] [XMLNamespace]
Bray, T., Hollander, D. and A. Layman, "Namespaces in Bray, T., Hollander, D. and A. Layman, "Namespaces in
XML", World Wide Web Consortium Recommendation, January XML", World Wide Web Consortium Recommendation, January
1999, <http://www.w3.org/TR/REC-xml#sec-external-ent>. 1999, <http://www.w3.org/TR/REC-xml-names>.
[XMLSchema] [XMLSchema]
Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes", Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes",
World Wide Web Consortium Recommendation, May 2001, World Wide Web Consortium Recommendation, May 2001,
<http://www.w3.org/TR/xmlschema-2/#anyURI>. <http://www.w3.org/TR/xmlschema-2/#anyURI>.
[XPointer] [XPointer]
Grosso, P., Maler, E., Marsh, J. and N. Walsh, "XPointer Grosso, P., Maler, E., Marsh, J. and N. Walsh, "XPointer
Framework", World Wide Web Consortium Recommendation, Framework", World Wide Web Consortium Recommendation,
March 2003, <http://www.w3.org/TR/xptr-framework/ March 2003,
#escaping>. <http://www.w3.org/TR/xptr-framework/#escaping>.
Authors' Addresses Authors' Addresses
Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever
possible, for example as "D&#252;rst" in XML and HTML.) possible, for example as "D&#252;rst" in XML and HTML.)
World Wide Web Consortium World Wide Web Consortium
5322 Endo 5322 Endo
Fujisawa, Kanagawa 252-8520 Fujisawa, Kanagawa 252-8520
Japan Japan
skipping to change at page 37, line 30 skipping to change at page 38, line 46
Appendix A.1 New Scheme(s) Appendix A.1 New Scheme(s)
Introducing new schemes (for example httpi:, ftpi:,...) or a new Introducing new schemes (for example httpi:, ftpi:,...) or a new
metascheme (e.g. i:, leading to URI/IRI prefixes such as i:http:, metascheme (e.g. i:, leading to URI/IRI prefixes such as i:http:,
i:ftp:,...) was proposed to make IRI-to-URI conversion i:ftp:,...) was proposed to make IRI-to-URI conversion
scheme-dependent or to distinguish between percent-encodings scheme-dependent or to distinguish between percent-encodings
resulting from IRI-to-URI conversion and percent-encodings from resulting from IRI-to-URI conversion and percent-encodings from
legacy character encodings. legacy character encodings.
New schemes are not needed to distinguish URIs from true IRIs (i.e. New schemes are not needed to distinguish URIs from true IRIs (i.e.
IRIs that contain non-ASCII characters). The benefit of being able to IRIs that contain non-ASCII characters). The benefit of being able
detect the origin of percent-encodings is marginal, also because to detect the origin of percent-encodings is marginal, because UTF-8
UTF-8 can be detected with very high reliably. Deploying new schemes can be detected with very high reliability. Deploying new schemes is
is extremely hard. Not needing new schemes for IRIs makes deployment extremely hard, so not requiring new schemes for IRIs makes
of IRIs vastly easier. Making conversion scheme-dependent is highly deployment of IRIs vastly easier. Making conversion scheme-dependent
unadvisable. Using an uniform convention for conversion from IRIs to is highly inadvisable, and would be encouraged by separate schemes
URIs makes IRI implementation orthogonal from the introduction of for IRIs. Using an uniform convention for conversion from IRIs to
acual new schemes. URIs makes IRI implementation orthogonal to the introduction of
actual new schemes.
Appendix A.2 Other Character Encodings than UTF-8 Appendix A.2 Other Character Encodings than UTF-8
At an early stage, UTF-7 was considered as an alternative to UTF-8 At an early stage, UTF-7 was considered as an alternative to UTF-8
when converting IRIs to URIs. UTF-7 would not have needed when converting IRIs to URIs. UTF-7 would not have needed
percent-encoding, and would in most cases have been shorter than percent-encoding, and would in most cases have been shorter than
percent-encoded UTF-8. percent-encoded UTF-8.
UTF-8 avoids a double layering and overloading of the use of the "+" Using UTF-8 avoids a double layering and overloading of the use of
character. UTF-8 is fully compatible with US-ASCII, and has therefore the "+" character. UTF-8 is fully compatible with US-ASCII, and has
been recommended by the IETF, and is being used widely, while UTF-7 therefore been recommended by the IETF, and is being used widely,
has never been used much and is now clearly being discouraged. while UTF-7 has never been used much and is now clearly being
discouraged. Requiring implementations to convert from UTF-8 to
UTF-7 and back would be an additional implementation burden.
Appendix A.3 New Encoding Convention Appendix A.3 New Encoding Convention
Instead of using the existing percent-encoding convention of URIs, Instead of using the existing percent-encoding convention of URIs,
which is based on octets, the idea was to create a new encoding which is based on octets, the idea was to create a new encoding
convention, for example to use '%u' to introduce UCS code points. convention, for example to use '%u' to introduce UCS code points.
Using the existing octet-based percent-encoding mechanism does not Using the existing octet-based percent-encoding mechanism does not
need an upgrade of the URI syntax, and does not need corresponding need an upgrade of the URI syntax, and does not need corresponding
server upgrades. server upgrades.
Appendix A.4 Indicating Character Encodings in the URI/IRI Appendix A.4 Indicating Character Encodings in the URI/IRI
Some proposals suggested indicating the character encodings used in Some proposals suggested indicating the character encodings used in
an URI or IRI with some new syntactic convention in the URI itself, an URI or IRI with some new syntactic convention in the URI itself,
similar to the 'charset' parameter for emails and Web pages. As an similar to the 'charset' parameter for emails and Web pages. As an
example, the label in square brackets in http://www.example.org/ example, the label in square brackets in
ros[iso-8859-1]&#xE9; indicated that the following &#xE9; had to be http://www.example.org/ros[iso-8859-1]&#xE9; indicated that the
interpreted as iso-8859-1. following &#xE9; had to be interpreted as iso-8859-1.
Using UTF-8 only does not need an upgrade to the URI syntax. It Using UTF-8 only does not need an upgrade to the URI syntax. It
avoids potentially multiple labels that have to be copied correctly avoids potentially multiple labels that have to be copied correctly
in all cases, even on the side of a bus or on a napkin, leading to in all cases, even on the side of a bus or on a napkin, leading to
usability problems to the extent of being prohibitively annoying. usability problems to the extent of being prohibitively annoying.
Using UTF-8 only also reduces transcoding errors and confusions. Using UTF-8 only also reduces transcoding errors and confusions.
Intellectual Property Statement Intellectual Property Statement
The IETF takes no position regarding the validity or scope of any The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information made any independent effort to identify any such rights. Information
on the IETF's procedures with respect to rights in IETF Documents can on the procedures with respect to rights in RFC documents can be
be found in BCP 78 and BCP 79. found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr. http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary copyrights, patents or patent applications, or other proprietary
 End of changes. 

This html diff was produced by rfcdiff 1.12, available from http://www.levkowetz.com/ietf/tools/rfcdiff/