draft-duerst-iri-02.txt   draft-duerst-iri-03.txt 

Network Working Group M. Duerst Network Working Group M. Duerst
Internet-Draft W3C Internet-Draft W3C
Expires: May 4, 2003 M. Suignard Expires: August 31, 2003 M. Suignard
Microsoft Corporation Microsoft Corporation
November 3, 2002 March 2, 2003
Internationalized Resource Identifiers (IRIs) Internationalized Resource Identifiers (IRIs)
draft-duerst-iri-02 draft-duerst-iri-03
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance with This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026. all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet- other groups may also distribute working documents as Internet-
Drafts. Drafts.
skipping to change at page 1, line 34 skipping to change at page 1, line 33
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at http:// The list of current Internet-Drafts can be accessed at http://
www.ietf.org/ietf/1id-abstracts.txt. www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on May 4, 2003. This Internet-Draft will expire on August 31, 2003.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2002). All Rights Reserved. Copyright (C) The Internet Society (2003). All Rights Reserved.
Abstract Abstract
This document defines a new protocol element, the Internationalized This document defines a new protocol element, the Internationalized
Resource Identifier (IRI), as a complement to the URI [RFC2396]. An Resource Identifier (IRI), as a complement to the URI [RFC2396]. An
IRI is a sequence of characters from the Universal Character Set IRI is a sequence of characters from the Universal Character Set
[ISO10646]. A mapping from IRIs to URIs is defined, which means that [ISO10646]. A mapping from IRIs to URIs is defined, which means that
IRIs can be used instead of URIs where appropriate to identify IRIs can be used instead of URIs where appropriate to identify
resources. resources.
skipping to change at page 2, line 16 skipping to change at page 2, line 16
formats, and software components that now deal with URIs are formats, and software components that now deal with URIs are
provided. provided.
NOTE NOTE
This document is a product of the Internationalization Working Group This document is a product of the Internationalization Working Group
(I18N WG) of the World Wide Web Consortium (W3C). For general (I18N WG) of the World Wide Web Consortium (W3C). For general
discussion, please use the www-international@w3.org mailing list discussion, please use the www-international@w3.org mailing list
(publicly archived at http://lists.w3.org/Archives/Public/www- (publicly archived at http://lists.w3.org/Archives/Public/www-
international/). For more information on the topic of this document, international/). For more information on the topic of this document,
please also see [W3CIRI] and [Duer01]. please also see [W3CIRI] and [Duerst01].
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Overview and Motivation . . . . . . . . . . . . . . . . . . . 4 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . . 4
1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . . . 7 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . . 7
2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . . . 7 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . . 7
2.3 IRI Equivalence and Normalization . . . . . . . . . . . . . . 10 2.3 IRI Equivalence and Normalization . . . . . . . . . . . . . 10
3. Relationship between IRIs and URIs . . . . . . . . . . . . . . 12 3. Relationship between IRIs and URIs . . . . . . . . . . . . . 11
3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . . . 12 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . . 12
3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . . . 14 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . . 14
4. Bidirectional IRIs for Right-to-left Languages . . . . . . . . 15 3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1 Logical Storage and Visual Presentation . . . . . . . . . . . 15 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . 16
4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Logical Storage and Visual Presentation . . . . . . . . . . 17
4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . 17
4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . 18
5. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.1 Limitations on UCS Characters Allowed in IRIs . . . . . . . . 19 5. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Software Interfaces and Protocols . . . . . . . . . . . . . . 20 5.1 Limitations on UCS Characters Allowed in IRIs . . . . . . . 20
5.3 Format of URIs and IRIs in Documents and Protocols . . . . . . 20 5.2 Software Interfaces and Protocols . . . . . . . . . . . . . 21
5.4 Relative IRI References . . . . . . . . . . . . . . . . . . . 21 5.3 Format of URIs and IRIs in Documents and Protocols . . . . . 21
6. URI/IRI Processing Guidelines (informative) . . . . . . . . . 21 5.4 Relative IRI References . . . . . . . . . . . . . . . . . . 22
6.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . . . 21 6. URI/IRI Processing Guidelines (informative) . . . . . . . . 22
6.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . . . 21 6.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . . 22
6.3 URI/IRI Transfer Between Applications . . . . . . . . . . . . 22 6.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . . 23
6.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . . . 23 6.3 URI/IRI Transfer Between Applications . . . . . . . . . . . 23
6.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . . . 23 6.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . . 24
6.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . . . 24 6.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . . 24
6.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . . . 24 6.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . . 25
6.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . . . 25 6.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . . 25
7. Security Considerations . . . . . . . . . . . . . . . . . . . 26 6.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . . 26
8. Change log . . . . . . . . . . . . . . . . . . . . . . . . . . 27 7. Security Considerations . . . . . . . . . . . . . . . . . . 27
8.1 Changes from -01 to -02 . . . . . . . . . . . . . . . . . . . 27 8. Issues List . . . . . . . . . . . . . . . . . . . . . . . . 28
8.2 Changes from -00 to -01 . . . . . . . . . . . . . . . . . . . 27 9. Change log . . . . . . . . . . . . . . . . . . . . . . . . . 28
9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 27 9.1 Changes from -02 to -03 . . . . . . . . . . . . . . . . . . 28
Normative References . . . . . . . . . . . . . . . . . . . . . 28 9.2 Changes from -01 to -02 . . . . . . . . . . . . . . . . . . 29
Non-normative References . . . . . . . . . . . . . . . . . . . 29 9.3 Changes from -00 to -01 . . . . . . . . . . . . . . . . . . 29
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 31 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 29
Full Copyright Statement . . . . . . . . . . . . . . . . . . . 32 Normative References . . . . . . . . . . . . . . . . . . . . 30
Non-normative References . . . . . . . . . . . . . . . . . . 31
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 33
Full Copyright Statement . . . . . . . . . . . . . . . . . . 34
1. Introduction 1. Introduction
1.1 Overview and Motivation 1.1 Overview and Motivation
A URI is defined in [RFC2396] as a sequence of characters chosen from A URI is defined in [RFC2396] as a sequence of characters chosen from
a limited subset of the repertoire of US-ASCII characters. a limited subset of the repertoire of US-ASCII characters.
The characters in URIs are frequently used for representing words of The characters in URIs are frequently used for representing words of
natural languages. Such usage has many advantages: such URIs are natural languages. Such usage has many advantages: such URIs are
skipping to change at page 5, line 26 skipping to change at page 5, line 26
UTF-8. For new URI schemes, this is recommended in [RFC2718]. UTF-8. For new URI schemes, this is recommended in [RFC2718].
This allows IRIs to be used with the URN syntax [RFC2141] as This allows IRIs to be used with the URN syntax [RFC2141] as
well as recent URL scheme definitions based on UTF-8, such as well as recent URL scheme definitions based on UTF-8, such as
IMAP URLs [RFC2192] and POP URLs [RFC2384]. IMAP URLs [RFC2192] and POP URLs [RFC2384].
In cases and for pieces where an encoding other than UTF-8 is used, In cases and for pieces where an encoding other than UTF-8 is used,
and for raw binary data encoded in URIs (see [RFC2397]), the octets and for raw binary data encoded in URIs (see [RFC2397]), the octets
have to be %-escaped. In these situations, the ability of IRIs to have to be %-escaped. In these situations, the ability of IRIs to
directly represent a wide character repertoire cannot be used. directly represent a wide character repertoire cannot be used.
For example, for a document with a URI of http://www.example.org/ For example, for a document with a URI of
r%C3%A9sum%C3%A9.html, it is possible to construct a corresponding http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to
IRI (in XML notation): http://www.example.org/résumé.html construct a corresponding IRI (in XML notation, see Section 1.4):
(é stands for the e-acute character, and is the UTF-8 encoded http://www.example.org/résumé.html (é stands for the
and escaped representation of that character). On the other hand, e-acute character, and is the UTF-8 encoded and escaped
for a document with an URI of http://www.example.org/r%e9sum%e9.html, representation of that character). On the other hand, for a document
the escaped octets cannot be converted to actual characters in an with an URI of http://www.example.org/r%E9sum%E9.html, the escaped
IRI, because the escaping is based on iso-8859-1 rather than UTF-8. octets cannot be converted to actual characters in an IRI, because
the escaping is based on iso-8859-1 rather than UTF-8.
1.3 Definitions 1.3 Definitions
The following definitions are used in this document; they follow the The following definitions are used in this document; they follow the
terms in [RFC2130], [RFC2277] and [ISO10646]: terms in [RFC2130], [RFC2277] and [ISO10646]:
character: A member of a set of elements used for the character: A member of a set of elements used for the
organization, control, or representation of data. For example, organization, control, or representation of data. For example,
"LATIN CAPITAL LETTER A" names a character. "LATIN CAPITAL LETTER A" names a character.
skipping to change at page 6, line 26 skipping to change at page 6, line 26
character encoding. character encoding.
UCS: Universal Character Set; the coded character set defined by UCS: Universal Character Set; the coded character set defined by
[ISO10646] and [UNIV3]. [ISO10646] and [UNIV3].
IRI reference: The term "IRI reference" denotes the common usage IRI reference: The term "IRI reference" denotes the common usage
of an internationalized resource identifier. An IRI reference of an internationalized resource identifier. An IRI reference
may be absolute or relative, and may have additional may be absolute or relative, and may have additional
information attached in the form of a fragement identifier. information attached in the form of a fragement identifier.
However, the "IRI" that results from such a reference only However, the "IRI" that results from such a reference only
includes the absolute IRI after fragment identifier (if any) is includes the absolute IRI after the fragment identifier (if
removed and after any relative IRI is resolved to its absolute any) is removed and after any relative IRI is resolved to its
form. absolute form.
1.4 Notation 1.4 Notation
RFCs and Internet Drafts currently do not allow any characters
outside the US-ASCII repertoire. Therefore, this document uses
various special notations to denote such characters.
In text, characters outside US-ASCII are sometimes referenced by In text, characters outside US-ASCII are sometimes referenced by
using a prefix of 'U+', followed by four to six hexadecimal digits. using a prefix of 'U+', followed by four to six hexadecimal digits.
To represent characters outside US-ASCII in examples, this document To represent characters outside US-ASCII in examples, this document
uses two notations called 'XML Notation' and 'Bidi Notation'. uses two notations called 'XML Notation' and 'Bidi Notation'.
XML Notation uses leading '&#x', trailing ';', and the hexadecimal XML Notation uses leading '&#x', trailing ';', and the hexadecimal
number of the character in the UCS in between. Example: Я stands number of the character in the UCS in between. Example: я
for CYRILLIC CAPITAL LETTER YA. In this notation, an actual '&' is stands for CYRILLIC CAPITAL LETTER YA. In this notation, an actual
denoted by '&amp'. '&' is denoted by '&amp'.
Bidi Notation is used for bidirectional examples: lower case ASCII Bidi Notation is used for bidirectional examples: lower case ASCII
letters stand for Latin letters or other letters that are written letters stand for Latin letters or other letters that are written
left-to-right, whereas upper case letters represent Arabic or Hebrew left-to-right, whereas upper case letters represent Arabic or Hebrew
letters that are written right-to-left. letters that are written right-to-left.
2. IRI Syntax 2. IRI Syntax
This section defines the syntax of Internationalized Resource This section defines the syntax of Internationalized Resource
Identifiers (IRIs). Identifiers (IRIs).
skipping to change at page 7, line 50 skipping to change at page 8, line 5
because it is in the 'unreserved' category in URIs. because it is in the 'unreserved' category in URIs.
2.2 ABNF for IRI References and IRIs 2.2 ABNF for IRI References and IRIs
While it might be possible to define IRI references and IRIs merely While it might be possible to define IRI references and IRIs merely
by their transformation to URI references and URIs, they can also be by their transformation to URI references and URIs, they can also be
accepted and processed directly. Therefore, an ABNF definition for accepted and processed directly. Therefore, an ABNF definition for
IRI references (which are the most general concept and the start of IRI references (which are the most general concept and the start of
the grammar) and IRIs is given here. The syntax of this ABNF is the grammar) and IRIs is given here. The syntax of this ABNF is
described in [RFC2234]. Character numbers are taken from the UCS, described in [RFC2234]. Character numbers are taken from the UCS,
without implying any actual binary encoding. without implying any actual binary encoding. Terminals in the ABNF
are characters, not bytes.
The following rules are different from [RFC2396]: The following rules are different from [RFC2396]:
absolute-IRI-reference = absolute-IRI [ "#" ifragment ] absolute-IRI-reference = absolute-IRI [ "#" ifragment ]
IRI-reference = [ absolute-IRI / relative-IRI ] IRI-reference = [ absolute-IRI / relative-IRI ]
[ "#" ifragment ] [ "#" ifragment ]
absolute-IRI = scheme ":" ( ihier-part / iopaque-part ) absolute-IRI = scheme ":" ( ihier-part / iopaque-part )
relative-IRI = [ inet-path / iabs-path / irel-path ] relative-IRI = [ inet-path / iabs-path / irel-path ]
[ "?" iquery ] [ "?" iquery ]
skipping to change at page 8, line 40 skipping to change at page 8, line 43
ireg-name = 1*( iunreserved / escaped / ";" / ireg-name = 1*( iunreserved / escaped / ";" /
":" / "@" / "&" / "=" / "+" / "$" / "," ) ":" / "@" / "&" / "=" / "+" / "$" / "," )
iserver = [ [ iuserinfo "@" ] ihostport ] iserver = [ [ iuserinfo "@" ] ihostport ]
iuserinfo = *( iunreserved / escaped / ";" / iuserinfo = *( iunreserved / escaped / ";" /
":" / "&" / "=" / "+" / "$" / "," ) ":" / "&" / "=" / "+" / "$" / "," )
ihostport = ihost [ ":" port ] ihostport = ihost [ ":" port ]
ihost = IPv6reference / IPv4address / ihostname ihost = IPv6reference / IPv4address / ihostname
ihostname = << as specified by [RFCXXXX] >> ihostname = idomainlabel [ iqualified]
iqualified = *( "." idomainlabel ) [ "." itoplabel [ "." ] ]
idomainlabel = <<See following production rules>>
itoplabel = <<See following production rules>>
ipath = [ iabs-path / iopaque-part ] ipath = [ iabs-path / iopaque-part ]
ipath-segments = isegment *( "/" isegment ) ipath-segments = isegment *( "/" isegment )
isegment = *ipchar isegment = *ipchar
ipchar = iunreserved / escaped / ";" / ipchar = iunreserved / escaped / ";" /
":" / "@" / "&" / "=" / "+" / "$" / "," ":" / "@" / "&" / "=" / "+" / "$" / ","
iquery = *( ipchar / "/" / "?" ) iquery = *( ipchar / iprivate / "/" / "?" )
ifragment = *( ipchar / "/" / "?" ) ifragment = *( ipchar / "/" / "?" )
iric = reserved / iunreserved / escaped iric = reserved / iunreserved / escaped
iunreserved = ichar / unreserved iunreserved = unreserved / ucschar / iadditional
ichar = idelims / ucschar / " " / "{" / "}" / "|" iadditional = "<" / ">" / DQUOTE / SP / "{" / "}" /
/ "\" / "^" / "`" "|" / "\" / "^" / "`"
idelims = "<" / ">" / DQUOTE
ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF /
/ %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
/ %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
/ %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
/ %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
/ %xD0000-DFFFD / %xE1000-EFFFD / %xD0000-DFFFD / %xE1000-EFFFD
iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD
The 'idomainlabel' and 'itoplabel' production rules are as follows:
The values 'idomainlabel' and 'itoplabel' are defined as a string of
'ucschar' obeying the following rules:
a) Given a string of 'ucschar' values, the ToASCII operation
[RFCXXXX] is performed on that string with the flag
UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set
to FALSE for creating IRIs and set to TRUE otherwise.
b) ToASCII is successful and results in a string conforming to
'domainlabel' for 'idomainlabel' and 'toplabel' for 'itoplabel'
(see below for 'domainlabel' and 'toplabel').
Note that the space character and various delimiters are allowed in Note that the space character and various delimiters are allowed in
IRIs and IRI references. This is further discussed in Section 5.1. IRIs and IRI references. This is further discussed in Section 5.1.
The following are the same as [RFC2396bis]: The following are the same as [RFC2396bis]:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
port = *DIGIT port = *DIGIT
domainlabel = alphanum [ 0*61( alphanum | "-" ) alphanum ]
toplabel = alpha [ 0*61( alphanum | "-" ) alphanum ]
alphanum = ALPHA / DIGIT alphanum = ALPHA / DIGIT
IPv4address = dec-octet 3( "." dec-octet ) IPv4address = dec-octet 3( "." dec-octet )
dec-octet = DIGIT / ; 0-9 dec-octet = DIGIT / ; 0-9
( %x31-39 DIGIT ) / ; 10-99 ( %x31-39 DIGIT ) / ; 10-99
( "1" 2*DIGIT ) / ; 100-199 ( "1" 2*DIGIT ) / ; 100-199
( "2" %x30-34 DIGIT ) / ; 200-249 ( "2" %x30-34 DIGIT ) / ; 200-249
( "25" %x30-35 ) ; 250-255 ( "25" %x30-35 ) ; 250-255
IPv6reference = "[" IPv6address "]" IPv6reference = "[" IPv6address "]"
IPv6address = ( 7( h4 ":" ) h4 ) / IPv6address = ( 7( h4 ":" ) h4 ) /
( "::" 0*6( h4 ":" ) [ h4 ] ) / ( "::" 0*6( h4 ":" ) [ h4 ] ) /
( h4 "::" 0*5( h4 ":" ) [ h4 ] ) / ( h4 "::" 0*5( h4 ":" ) [ h4 ] ) /
( h4 ":" h4 "::" 0*4( h4 ":" ) [ h4 ] ) / ( h4 ":" h4 "::" 0*4( h4 ":" ) [ h4 ] ) /
( h4 2( ":" h4 ) "::" 0*3( h4 ":" ) [ h4 ] ) / ( h4 2( ":" h4 ) "::" 0*3( h4 ":" ) [ h4 ] ) /
( h4 3( ":" h4 ) "::" 0*2( h4 ":" ) [ h4 ] ) / ( h4 3( ":" h4 ) "::" 0*2( h4 ":" ) [ h4 ] ) /
( h4 4( ":" h4 ) "::" 0*1( h4 ":" ) [ h4 ] ) / ( h4 4( ":" h4 ) "::" 0*1( h4 ":" ) [ h4 ] ) /
( 6( h4 ":" ) IPv4address )/ ( 6( h4 ":" ) IPv4address )/
( "::" 0*5( h4 ":" ) IPv4address )/ ( "::" 0*5( h4 ":" ) IPv4address )/
skipping to change at page 10, line 35 skipping to change at page 10, line 21
( h4 4( ":" h4 ) "::" 0*1( h4 ":" ) [ h4 ] ) / ( h4 4( ":" h4 ) "::" 0*1( h4 ":" ) [ h4 ] ) /
( 6( h4 ":" ) IPv4address )/ ( 6( h4 ":" ) IPv4address )/
( "::" 0*5( h4 ":" ) IPv4address )/ ( "::" 0*5( h4 ":" ) IPv4address )/
( h4 "::" 0*4( h4 ":" ) IPv4address )/ ( h4 "::" 0*4( h4 ":" ) IPv4address )/
( h4 ":" h4 "::" 0*3( h4 ":" ) IPv4address )/ ( h4 ":" h4 "::" 0*3( h4 ":" ) IPv4address )/
( h4 2( ":" h4 ) "::" 0*2( h4 ":" ) IPv4address )/ ( h4 2( ":" h4 ) "::" 0*2( h4 ":" ) IPv4address )/
( h4 3( ":" h4 ) "::" 0*1( h4 ":" ) IPv4address ) ( h4 3( ":" h4 ) "::" 0*1( h4 ":" ) IPv4address )
h4 = 1*4HEXDIG h4 = 1*4HEXDIG
reserved = "[" / "]" / ";" / "/" / "?" / reserved = "[" / "]" / ";" / "/" / "?" /
":" / "@" / "&" / "=" / "+" / "$" / "," / ":" / "@" / "&" / "=" / "+" / "$" / ","
unreserved = ALPHA / DIGIT / mark unreserved = ALPHA / DIGIT / mark
mark = "-" / "_" / "." / "!" / "~" / "*" / "'" / mark = "-" / "_" / "." / "!" / "~" / "*" / "'" /
"(" / ")" "(" / ")"
escaped = "%" HEXDIG HEXDIG escaped = "%" HEXDIG HEXDIG
2.3 IRI Equivalence and Normalization 2.3 IRI Equivalence and Normalization
There is no general rule or procedure to decide whether two arbitrary There is no general rule or procedure to decide whether two arbitrary
IRIs are equivalent or not (i.e. refer to the same resource or not). IRIs are equivalent or not (i.e. refer to the same resource or not).
Two IRIs that look almost the same may refer to different resources. Two IRIs that look almost the same may refer to different resources.
Two IRIs that look completely different may refer to, and resolve to, Two IRIs that look completely different may refer to, and resolve to,
the same resource. the same resource.
In some scenarios, such as XML Namespaces ([XMLNamespace]), a In some scenarios a definite answer to the question of IRI
definite answer to the question of IRI equivalence is needed that is equivalence is needed that is independent of the scheme used and
independent of the scheme used and always can be calculated quickly always can be calculated quickly and without accessing a network. An
and without accessing a network. In such cases, two IRIs SHOULD be example of such a case might be XML Namespaces ([XMLNamespace]). In
defined as equivalent if and only if they are character-by-character such cases, two IRIs SHOULD be defined as equivalent if and only if
equivalent. This is the same as being byte-by-byte equivalent if the they are character-by-character equivalent. This is the same as
character encoding for both IRIs is the same. As an example, being byte-by-byte equivalent if the character encoding for both IRIs
is the same. As an example,
http://example.org/~user, http://example.org/%7euser, and http://example.org/~user, http://example.org/%7euser, and
http://example.org/%7Euser would not be equivalent. In such a case, http://example.org/%7Euser would not be equivalent under this
the comparison function MUST NOT map the IRIs to URIs. definition. In such a case, the comparison function MUST NOT map the
IRIs to URIs, because such a mapping would create something different
under this equivalence relationship.
It follows from the above that IRIs SHOULD NOT be modified when being It follows from the above that IRIs SHOULD NOT be modified when being
transported. transported.
For actual resolution, differences in escaping (except for the For actual resolution, differences in escaping (except for the
escaping of reserved characters) MUST always result in the same escaping of reserved characters) MUST always result in the same
resource. For example, http://example.org/~user, resource. For example, http://example.org/~user,
http://example.org/%7euser and http://example.org/%7Euser must http://example.org/%7euser and http://example.org/%7Euser must
resolve to the same resource. If this kind of equivalence is to be resolve to the same resource. If this kind of equivalence is to be
tested, the escaping of both IRIs to be compared has to be aligned, tested, the escaping of both IRIs to be compared has to be aligned,
skipping to change at page 11, line 34 skipping to change at page 11, line 24
escape is always the same. Such conversions MUST only be done on the escape is always the same. Such conversions MUST only be done on the
fly, without changing the original IRI. fly, without changing the original IRI.
Specific schemes and resolution mechanisms may define additional Specific schemes and resolution mechanisms may define additional
equivalences. For a specific scheme, two IRIs that e.g. differ only equivalences. For a specific scheme, two IRIs that e.g. differ only
by case may be equivalent. However, this document does not deal with by case may be equivalent. However, this document does not deal with
scheme-specific issues. scheme-specific issues.
The Unicode Standard [UNIV3] defines various equivalences between The Unicode Standard [UNIV3] defines various equivalences between
sequences of characters for various purposes. Unicode Standard Annex sequences of characters for various purposes. Unicode Standard Annex
#15 [UNI15] defines various Normalization Forms for these #15 [UTR15] defines various Normalization Forms for these
equivalences. IRIs SHOULD be created using Normalization Form C equivalences. IRIs SHOULD be created using Normalization Form C
(NFC). Equivalence of IRIs MUST rely on the IRIs being appropriately (NFC). Equivalence of IRIs MUST rely on the assumtion that IRIs are
pre-normalized, rather than applying normalization, except when appropriately pre-normalized, rather than applying normalization when
converting from a non-UCS-based encoding to an UCS-based encoding, comparing two IRIs, except when converting from a non-UCS-based
where a normalizing transcoder using NFC MUST be used. encoding to an UCS-based encoding, where a normalizing transcoder
using NFC MUST be used for interoperability.
As an example, http://www.example.org/r&#xe9;sum&#xe9;.html (in XML As an example, http://www.example.org/r&#xe9;sum&#xe9;.html (in XML
Notation) is in NFC. On the other hand, http://www.example.org/ Notation) is in NFC. On the other hand, http://www.example.org/
re&#x301;sume&#x301;.html is not in NFC. The former uses precombined re&#x301;sume&#x301;.html is not in NFC. The former uses precombined
e-acute characters, the later uses 'e' characters followed by e-acute characters, the later uses 'e' characters followed by
combining acute accents, both are defined as canonically equivalent combining acute accents, both are defined as canonically equivalent
in [UNIV3]. in [UNIV3].
Various IRI schemes may allow the usage of International Domain Names Various IRI schemes may allow the usage of International Domain Names
(IDN) [RFCXXXX]. When in use in IRIs, those names SHOULD be (IDN) [RFCXXXX]. When in use in IRIs, those names SHOULD be
skipping to change at page 12, line 46 skipping to change at page 12, line 37
b) Interpretational: URIs identify resources in various ways. b) Interpretational: URIs identify resources in various ways.
IRIs also identify resources. When the IRI is used simply for IRIs also identify resources. When the IRI is used simply for
identification purposes, it is not necessary to map the IRI to identification purposes, it is not necessary to map the IRI to
an URI (see Section 2.3). However, when an IRI is used for an URI (see Section 2.3). However, when an IRI is used for
resource retrieval, the resource that the IRI locates is the resource retrieval, the resource that the IRI locates is the
same as the one located by the URI obtained after converting same as the one located by the URI obtained after converting
the IRI according to the procedure defined here. This means the IRI according to the procedure defined here. This means
that there is no need to define resolution separately on the that there is no need to define resolution separately on the
IRI level. IRI level.
This mapping is accomplished in two steps. Applications MUST map IRIs to URIs using the following two steps.
Step 1) This step generates a UCS-based encoding from the original Step 1) This step generates a UCS-based encoding from the original
IRI format. This step has three variants, depending on the IRI format. This step has three variants, depending on the
form of the input. form of the input.
Variant A) If the IRI is written on paper or read out loud, Variant A) If the IRI is written on paper or read out loud,
or otherwise represented as a sequence of characters or otherwise represented as a sequence of characters
independent of any encoding: Represent the IRI as a independent of any encoding: Represent the IRI as a
sequence of characters from the UCS normalized according sequence of characters from the UCS normalized according
to Normalization Form C (NFC, [UNI15]). to Normalization Form C (NFC, [UTR15]).
Variant B) If the IRI is in some digital representation Variant B) If the IRI is in some digital representation
(e.g. an octet stream) in some non-Unicode encoding: (e.g. an octet stream) in some non-Unicode encoding:
Convert the IRI to a sequence of characters from the UCS Convert the IRI to a sequence of characters from the UCS
normalized according to NFC. normalized according to NFC.
Variant C) If the IRI is in an Unicode-based encoding (for Variant C) If the IRI is in an Unicode-based encoding (for
example UTF-8 or UTF-16): Do not normalize. Move example UTF-8 or UTF-16): Do not normalize. Move
directly to Step 2. directly to Step 2.
Step 2) For each character that is disallowed in URI references, Step 2) For each character that is disallowed in URI references,
apply steps 1) through 3) below. The disallowed characters apply steps 1) through 3) below. The disallowed characters
consist of all non-ASCII characters, plus the excluded consist of all non-ASCII characters, plus the excluded
characters listed in Section 2.4 of [RFC2396], except for the characters listed in Section 2.4 of [RFC2396], except for the
number sign (#) and percent sign (%) and the square bracket number sign (#) and percent sign (%) and the square bracket
characters re-allowed in [RFC2732]. characters re-allowed in [RFC2732].
1) Convert the character to a sequence of one or more octets 1) Convert the character to a sequence of one or more octets
using UTF-8 [RFC2279]. using UTF-8 [RFC2279].
2) Convert each octet to %hh, where hh is the hexadecimal 2) Convert each octet to %HH, where HH is the hexadecimal
notation of the octet value. Note: This is identical to notation of the octet value. Note: This is identical to
the escaping mechanism in Section 2.4.1 of [RFC2396]. the escaping mechanism in Section 2.4.1 of [RFC2396].
Note: To reduce variability, the hexadecimal notation Note: To reduce variability, the hexadecimal notation
SHOULD use lower case letters. SHOULD use upper case letters.
3) Replace the original character by the resulting character 3) Replace the original character by the resulting character
sequence. sequence (i.e. a sequence of %HH triplets).
Note that in this process (in step 2.3), characters allowed in URI Note that in this process (in step 2.3), characters allowed in URI
references and existing escape sequences are not escaped further. references and existing escape sequences are not escaped further.
(This mapping is similar to, but different from, the escaping applied (This mapping is similar to, but different from, the escaping applied
when including arbitrary content into some part of a URI.) For when including arbitrary content into some part of a URI.) For
example, an IRI of example, an IRI of
http://www.example.org/red%09ros&#xe9;#<red> (in XML notation) is http://www.example.org/red%09ros&#xe9;#<red> (in XML notation) is
converted to converted to
http://www.example.org/red%09ros%c3%a9#%3cred%3e, not to something http://www.example.org/red%09ros%C3%A9#%3Cred%3E, not to something
like like
http%3a%2f%2fwww.example.org%2fred%2509ros%c3%a9%23red. http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red.
Note that some older software transcoding to UTF-8 may produce Note that some older software transcoding to UTF-8 may produce
illegal output for some input, in particular for characters outside illegal output for some input, in particular for characters outside
the BMP (Basic Multilingual Plane). As an example, for the following the BMP (Basic Multilingual Plane). As an example, for the following
IRI with non-BMP characters (in XML Notation): IRI with non-BMP characters (in XML Notation):
http://example.com/ http://example.com/
(the first three letters of the Old Italic alphabet) the correct (the first three letters of the Old Italic alphabet) the correct
conversion to a URI is: conversion to a URI is:
http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82 http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82
skipping to change at page 14, line 46 skipping to change at page 14, line 37
a) Some escape sequences are necessary to distinguish escaped and a) Some escape sequences are necessary to distinguish escaped and
unescaped uses of reserved characters. unescaped uses of reserved characters.
b) Some escape sequences cannot be interpreted as sequences of b) Some escape sequences cannot be interpreted as sequences of
UTF-8 octets. UTF-8 octets.
(Note: Due to the regularities in the octet patterns of UTF-8, (Note: Due to the regularities in the octet patterns of UTF-8,
there is a very high probability, but no guarantee, that escape there is a very high probability, but no guarantee, that escape
sequences that can be interpreted as sequences of UTF-8 octets sequences that can be interpreted as sequences of UTF-8 octets
actually originated from UTF-8. For a detailed discussion, see actually originated from UTF-8. For a detailed discussion, see
[Duer97].) [Duerst97].)
c) The conversion may result in a character that is not c) The conversion may result in a character that is not
appropriate in an IRI. See Section 5.1 for further details. appropriate in an IRI. See Section 5.1 for further details.
Conversion from a URI to an IRI is done using the following steps (or Conversion from a URI to an IRI is done using the following steps (or
any other algorithm that produces the same result): any other algorithm that produces the same result):
1) Represent the URI as a sequence of octets in US-ASCII. 1) Represent the URI as a sequence of octets in US-ASCII.
2) Convert all hexadecimal escapes (% followed by two hexadecimal 2) Convert all hexadecimal escapes (% followed by two hexadecimal
digits) except those corresponding to '#' and '%' and digits) except those corresponding to '#' and '%' and
characters in 'reserved', to the corresponding octets. characters in 'reserved', to the corresponding octets.
3) Re-escape any octets that are not part of a strictly legal UTF- 3) Re-escape any octet produced in step 2) that is not part of a
8 octet sequence. strictly legal UTF-8 octet sequence.
4) Re-escape all octets that in UTF-8 represent characters that 4) Re-escape all octets produced in step 2) that in UTF-8
are not appropriate according to Section 5.1. represent characters that are not appropriate according to
Section 4.1 and Section 5.1.
5) Interpret the resulting octet sequence as a sequence of 5) Interpret the resulting octet sequence as a sequence of
characters encoded in UTF-8. characters encoded in UTF-8.
This procedure will convert as many escaped non-ASCII characters as This procedure will convert as many escaped non-ASCII characters as
possible to characters in an IRI. Because there are some choices possible to characters in an IRI. Because there are some choices
when applying step 4) (see Section 5.1), results may differ. when applying step 4) (see Section 5.1), results may differ.
Conversions from URIs to IRIs MUST NOT use any other encoding than
UTF-8 in steps 3) and 4) above, even if it might be possible from
context to guess that another encoding than UTF-8 was used in the
URI. As an example, the URI http://www.example.org/r%E9sum%E9.html,
which with some guesses might be interpreted to contain two e-acute
characters encoded as iso-8859-1, must not be converted to an IRI
containing these e-acute characters. Otherwise, the IRI will in the
future be mapped to http://www.example.org/r%C3%A9sum%C3%A9.html,
which is a different URI from http://www.example.org/r%E9sum%E9.html.
3.2.1 Examples
This section shows various examples of converting URIs to IRIs. The
notation <hh> is used to denote octets outside those that can be
represented in this document. Each example shows the result after
applying each of the steps 1) to 5). XML Notation is used for the
final result.
The following example contains the sequence '%C3%BC', which is a
strictly legal UTF-8 sequence, and which is converted into the actual
character U+00FC LATIN SMALL LETTER U WITH DIAERESIS (also known as
u-umlaut).
1) http://www.example.org/D%C3%BCrst
2) http://www.example.org/D<c3><bc>rst
3) http://www.example.org/D<c3><bc>rst
4) http://www.example.org/D<c3><bc>rst
5) http://www.example.org/D&#xfc;rst
The following example contains the sequence '%FC', which might
represent U+00FC LATIN SMALL LETTER U WITH DIAERESIS in the iso-8859-
1 encoding. (It might represent other characters in other encodings.
For example, the octet <FC> in iso-8859-5 represents U+045C CYRILLIC
SMALL LETTER KJE.) Because <FC> is not part of a strictly legal UTF-8
sequence, it is re-escaped in step 2).
1) http://www.example.org/D%FCrst
2) http://www.example.org/D<FC>rst
3) http://www.example.org/D%FCrst
4) http://www.example.org/D%FCrst
5) http://www.example.org/D%FCrst
The following example contains '%e2%80%ae', which is the escaped UTF-
8 encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. Section 4.1 forbids
the direct use of this character in an IRI. Therefore, the
corresponding octets are re-escaped in step 3). This example shows
that the case (upper or lower) of letters used in escapes may not be
preserved.
1) http://www.example.org/%e2%80%ae
2) http://www.example.org/<E2><80><AE>
3) http://www.example.org/<E2><80><AE>
4) http://www.example.org/%E2%80%AE
5) http://www.example.org/%E2%80%AE
4. Bidirectional IRIs for Right-to-left Languages 4. Bidirectional IRIs for Right-to-left Languages
Some UCS characters, such as those used in the Arabic and Hebrew Some UCS characters, such as those used in the Arabic and Hebrew
script, have an inherent right-to-left writing direction. IRIs script, have an inherent right-to-left writing direction. IRIs
containing such characters (called bidirectional IRIs or Bidi IRIs) containing such characters (called bidirectional IRIs or Bidi IRIs)
require additional attention because of the non-trivial relation require additional attention because of the non-trivial relation
between logical representation (used for digital representation as between logical representation (used for digital representation as
well as when reading/spelling) and visual representation (used for well as when reading/spelling) and visual representation (used for
display/printing). display/printing).
skipping to change at page 24, line 6 skipping to change at page 25, line 18
Outside of the US-ASCII range, there are many more opportunities for Outside of the US-ASCII range, there are many more opportunities for
confusion; a complete set of guidelines is too lengthy to include confusion; a complete set of guidelines is too lengthy to include
here. As long as names are limited to characters from a single here. As long as names are limited to characters from a single
script, native writers of a given script or language will know best script, native writers of a given script or language will know best
when ambiguities can appear, and how they can be avoided. What may when ambiguities can appear, and how they can be avoided. What may
look ambiguous to a stranger may be completely obvious to the average look ambiguous to a stranger may be completely obvious to the average
native user. On the other hand, in some cases, the UCS contains native user. On the other hand, in some cases, the UCS contains
variants for compatibility reasons, for example for typographic variants for compatibility reasons, for example for typographic
purposes. These should be avoided wherever possible. Although there purposes. These should be avoided wherever possible. Although there
may be exceptions, in general newly created resource names should be may be exceptions, in general newly created resource names should be
in NFKC [UNI15] (which means that they are also in NFC). in NFKC [UTR15] (which means that they are also in NFC).
As an example, the UCS contains a codepoint for the 'fi' ligature. As an example, the UCS contains codepoint U+FB01 for the 'fi'
Wherever possible, IRIs should use the two letters 'f' and 'i' rather ligature for compatibility reasons. Wherever possible, IRIs should
than the 'fi' ligature. An example where the later may be used is in use the two letters 'f' and 'i' rather than the 'fi' ligature. An
the query part of an IRI for an explicit search for a word containing example where the later may be used is in the query part of an IRI
the 'fi' ligature. for an explicit search for a word containing the 'fi' ligature.
In certain cases, there is a chance that characters from different In certain cases, there is a chance that characters from different
scripts look the same. The best known example is the Latin 'A', the scripts look the same. The best known example is the Latin 'A', the
Greek 'Alpha', and the Cyrillic 'A'. To avoid such cases, only IRIs Greek 'Alpha', and the Cyrillic 'A'. To avoid such cases, only IRIs
should be generated where all the characters in a single component should be generated where all the characters in a single component
are used together in a given language. This usually means that all are used together in a given language. This usually means that all
these characters will be from the same script, but there are these characters will be from the same script, but there are
languages that mix characters from different scripts (such as languages that mix characters from different scripts (such as
Japanese). This is similar to the heuristics used to distinguish Japanese). This is similar to the heuristics used to distinguish
between letters and numbers in the examples above. Also, for Latin, between letters and numbers in the examples above. Also, for Latin,
skipping to change at page 25, line 17 skipping to change at page 26, line 30
how currently some servers treat URIs as case-insensitive, or perform how currently some servers treat URIs as case-insensitive, or perform
additional matching to account for spelling errors. For characters additional matching to account for spelling errors. For characters
beyond the ASCII repertoire, this may for example include ignoring beyond the ASCII repertoire, this may for example include ignoring
the accents on received IRIs or resource names where appropriate. the accents on received IRIs or resource names where appropriate.
Please note that such mappings, including case mappings, are Please note that such mappings, including case mappings, are
language-dependent. language-dependent.
It can be difficult to unambiguously identify a resource if too many It can be difficult to unambiguously identify a resource if too many
mappings are taken into consideration. However, escaped and non- mappings are taken into consideration. However, escaped and non-
escaped parts of IRIs can always clearly be distinguished. Also, the escaped parts of IRIs can always clearly be distinguished. Also, the
regularity of UTF-8 (see [Duer97]) makes the potential for collisions regularity of UTF-8 (see [Duerst97]) makes the potential for
lower than it may seem at first sight. collisions lower than it may seem at first sight.
6.8 Upgrading Strategy 6.8 Upgrading Strategy
Where this recommendation places further constraints on software for Where this recommendation places further constraints on software for
which many instances are already deployed, it is important to which many instances are already deployed, it is important to
introduce upgrades carefully, and to be aware of the various introduce upgrades carefully, and to be aware of the various
interdependencies. interdependencies.
If IRIs cannot be interpreted correctly, they should not be generated If IRIs cannot be interpreted correctly, they should not be generated
or transported. This suggests that upgrading URI interpreting or transported. This suggests that upgrading URI interpreting
skipping to change at page 26, line 32 skipping to change at page 27, line 45
similar, but may contain all kinds of changes that may be difficult similar, but may contain all kinds of changes that may be difficult
to spot but can cause all kinds of problems. Most spoofing to spot but can cause all kinds of problems. Most spoofing
possibilities for IRIs are extensions of those for URIs. possibilities for IRIs are extensions of those for URIs.
Spoofing can occur for various reasons. A first reason is that Spoofing can occur for various reasons. A first reason is that
normalization expectations of a user or actual normalization when normalization expectations of a user or actual normalization when
entering an IRI do not match the normalization used on the server entering an IRI do not match the normalization used on the server
side. Conceptually, this is no different from the problems side. Conceptually, this is no different from the problems
surrounding the use of case-insensitive web servers. For example, a surrounding the use of case-insensitive web servers. For example, a
popular web page with a mixed case name (http://big.site/ popular web page with a mixed case name (http://big.site/
PopularPage.html) might be "spoofed" by someone who obtains access to PopularPage.html) might be "spoofed" by someone who is able to create
http://big.site/popularpage.html. However, the introduction of http://big.site/popularpage.html. However, the introduction of
character normalization, and of additional mappings for user character normalization, and of additional mappings for user
convenience, may increase the chance for spoofing. convenience, may increase the chance for spoofing.
Spoofing can occur because in the UCS, there are many characters that Spoofing can occur because in the UCS, there are many characters that
look very similar. Details are discussed in Section 6.5. Again, look very similar. Details are discussed in Section 6.5. Again,
this is very similar to spoofing possibilities on US-ASCII, e.g. this is very similar to spoofing possibilities on US-ASCII, e.g.
using 'br0ken' or '1ame' URIs. using 'br0ken' or '1ame' URIs.
Spoofing can occur when URIs in various encodings are accepted to Spoofing can occur when URIs in various encodings are accepted to
deal with older user agents. In some cases, in particular for Latin- deal with older user agents. In some cases, in particular for Latin-
based resource names, this is usually easy to detect because UTF-8- based resource names, this is usually easy to detect because UTF-8-
encoded names, when interpreted and viewed as legacy encodings, encoded names, when interpreted and viewed as legacy encodings,
produce mostly garbage. In other cases, when concurrently used produce mostly garbage. In other cases, when concurrently used
encodings have a similar structure, but there are no characters that encodings have a similar structure, but there are no characters that
have exactly the same encoding, detection is more difficult. have exactly the same encoding, detection is more difficult.
skipping to change at page 27, line 14 skipping to change at page 28, line 27
part, see [Nameprep]. For the path part, administrators of sites part, see [Nameprep]. For the path part, administrators of sites
which allow independent users to create resources in the same subarea which allow independent users to create resources in the same subarea
may need to be careful to check for spoofing. may need to be careful to check for spoofing.
Spoofing can occur with bidirectional IRIs, if the restrictions in Spoofing can occur with bidirectional IRIs, if the restrictions in
Section 4.2 are not followed. The same visual representation may be Section 4.2 are not followed. The same visual representation may be
interpreted as different logical representations, and vice versa. It interpreted as different logical representations, and vice versa. It
is also very important that a correct Unicode bidirectional is also very important that a correct Unicode bidirectional
implementation is used. implementation is used.
8. Change log 8. Issues List
8.1 Changes from -01 to -02 - Should characters in iadditional be allowed? Under what
conditions?.
- Allign the description in Section 2.3 with the results of W3C
TAG discussions on issue URIEquivalence.
- Adapt depending on how [IDNURI] is integrated into
[RFC2396bis].
9. Change log
9.1 Changes from -02 to -03
- Added an issues list.
- Added a paragraph prohibiting conversions from URIs to IRIs not
based on UTF-8 to Section 3.2.
- Introduced iadditional to combine unwise, delims, and space.
- Tweaked description and added examples for URI-to-IRI
conversion.
- Improved syntax rules for hostname part.
- Improved description of equivalences in Section 2.3.
- Improved description of URI-to-IRI-mapping in Section 3.2.
- Changed preferred case when hex-escaping from lower to UPPER.
- Fixed various details.
9.2 Changes from -01 to -02
- New approach for Bidi section, many examples. - New approach for Bidi section, many examples.
- Created idelims, removed '%' and '#'. Changed userinfo to - Created idelims, removed '%' and '#'. Changed userinfo to
iuserinfo in iserver. iuserinfo in iserver.
- Changed to ABNF defined by [RFC2234]. - Changed to ABNF defined by [RFC2234].
- Included bug fixes from [RFC2396bis]. - Included bug fixes from [RFC2396bis].
- Additions to Acknowledgements. - Additions to Acknowledgements.
8.2 Changes from -00 to -01 9.3 Changes from -00 to -01
- Re-integrated the section on Bidi, some issues left. - Re-integrated the section on Bidi, some issues left.
- Integrated IDN, changed syntax (host, userinfo,....). - Integrated IDN, changed syntax (host, userinfo,....).
- Moved some text around, marked some as informational. - Moved some text around, marked some as informational.
- Made a clear distinction of IRI use for identification only and - Made a clear distinction of IRI use for identification only and
for resource resolution. for resource resolution.
- Fixed various details in wording, spelling,... - Fixed various details in wording, spelling,...
9. Acknowledgements 10. Acknowledgements
We would like to thank Larry Masinter for his work as coauthor of We would like to thank Larry Masinter for his work as coauthor of
many earlier versions of this document (draft-masinter-url-i18n-xx). many earlier versions of this document (draft-masinter-url-i18n-xx).
The discussion on the issue addressed here has started a long time The discussion on the issue addressed here has started a long time
ago. There was a thread in the HTML working group in August 1995 ago. There was a thread in the HTML working group in August 1995
(under the topic of "Globalizing URIs") and in the www-international (under the topic of "Globalizing URIs") and in the www-international
mailing list in July 1996 (under the topic of "Internationalization mailing list in July 1996 (under the topic of "Internationalization
and URLs"), and ad-hoc meetings at the Unicode conferences in and URLs"), and ad-hoc meetings at the Unicode conferences in
September 1995 and September 1997. September 1995 and September 1997.
Thanks to Francois Yergeau, Matti Allouche, Roy Fielding, Tim Thanks to Francois Yergeau, Matti Allouche, Roy Fielding, Tim
Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim
Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie
Daigle, Ted Hardie, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Daigle, Ted Hardie, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex
Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilly, Dan Oscarson, Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Dan Oscarson,
Elliotte Rusty Harold, Mike J. Brown, Carlos Viegas Damasio, and Elliotte Rusty Harold, Mike J. Brown, Carlos Viegas Damasio, and
many others for help with understanding the issues and possible many others for help with understanding the issues and possible
solutions, and getting the details right. Thanks also to the members solutions, and getting the details right. Thanks also to the members
of the W3C I18N Working Group and Interest Group for their of the W3C I18N Working Group and Interest Group for their
contributions and their work on [CharMod], to the members of many contributions and their work on [CharMod], to the members of many
other W3C WGs for adopting the ideas, and to the members of the other W3C WGs for adopting the ideas, and to the members of the
Montreal IAB Workshop on Internationalization and Localization for Montreal IAB Workshop on Internationalization and Localization for
their review. their review.
Normative References Normative References
skipping to change at page 28, line 50 skipping to change at page 30, line 49
[RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for
Literal IPv6 Addresses in URL's", RFC 2732, December Literal IPv6 Addresses in URL's", RFC 2732, December
1999. 1999.
[RFCXXXX] Faltstrom, P., Hoffman, P. and A. Costello, [RFCXXXX] Faltstrom, P., Hoffman, P. and A. Costello,
"Internationalizing Domain Names in Applications (IDNA)", "Internationalizing Domain Names in Applications (IDNA)",
draft-ietf-idn-idna-14.txt (work in progress), October draft-ietf-idn-idna-14.txt (work in progress), October
2002, <http://www.ietf.org/internet-drafts/draft-ietf- 2002, <http://www.ietf.org/internet-drafts/draft-ietf-
idn-idna-14.txt>. idn-idna-14.txt>.
[UNI15] Davis, M. and M. Duerst, "Unicode Normalization Forms", [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms",
Unicode Standard Annex #15, March 2001, <http:// Unicode Standard Annex #15, March 2001, <http://
www.unicode.org/unicode/reports/tr15/tr15-21.html>. www.unicode.org/unicode/reports/tr15/tr15-21.html>.
Non-normative References Non-normative References
[BidiEx] "Examples of bidirectional IRIs", <http://www.w3.org/ [BidiEx] "Examples of bidirectional IRIs", <http://www.w3.org/
International/iri-edit/BidiExamples>. International/iri-edit/BidiExamples>.
[CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M.,
Freytag, A. and T. Texin, "Character Model for the Freytag, A. and T. Texin, "Character Model for the
World Wide Web", World Wide Web Consortium Working World Wide Web", World Wide Web Consortium Working
Draft, April 2002, <http://www.w3.org/TR/charmod>. Draft, April 2002, <http://www.w3.org/TR/charmod>.
[Duer97] Duerst, M., "The Properties and Promises of UTF-8", [Duerst97] Duerst, M., "The Properties and Promises of UTF-8",
Proc. 11th International Unicode Conference, San Jose Proc. 11th International Unicode Conference, San Jose
, September 1997, <http://www.ifi.unizh.ch/mml/ , September 1997, <http://www.ifi.unizh.ch/mml/
mduerst/papers/PDF/IUC11-UTF-8.pdf>. mduerst/papers/PDF/IUC11-UTF-8.pdf>.
[Duer01] Duerst, M., "Internationalized Resource Identifiers: [Duerst01] Duerst, M., "Internationalized Resource Identifiers:
From Specification to Testing", Proc. 19th From Specification to Testing", Proc. 19th
International Unicode Conference, San Jose , International Unicode Conference, San Jose ,
September 2001, <http://www.w3.org/2001/Talks/0912- September 2001, <http://www.w3.org/2001/Talks/0912-
IUC-IRI/paper.html>. IUC-IRI/paper.html>.
[HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01
Specification", World Wide Web Consortium Specification", World Wide Web Consortium
Recommendation, December 1999, <http://www.w3.org/TR/ Recommendation, December 1999, <http://www.w3.org/TR/
REC-html40/appendix/notes.html#h-B.2>. REC-html40/appendix/notes.html#h-B.2>.
[IDNURI] Duerst, M., "Internationalized Domain Names in URIs", [IDNURI] Duerst, M., "Internationalized Domain Names in URIs",
draft-ietf-idn-uri-03.txt (work in progress), July draft-ietf-idn-uri-03.txt (work in progress),
2002, <http://www.ietf.org/internet-drafts/draft- November 2002, <http://www.ietf.org/internet-drafts/
ietf-idn-uri-03.txt>. draft-ietf-idn-uri-03.txt>.
[Nameprep] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep [Nameprep] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
Profile for Internationalized Domain Names", draft- Profile for Internationalized Domain Names", draft-
ietf-idn-nameprep-11.txt (work in progress), June ietf-idn-nameprep-11.txt (work in progress), June
2002, <http://www.ietf.org/internet-drafts/draft- 2002, <http://www.ietf.org/internet-drafts/draft-
ietf-idn-nameprep-11.txt>. ietf-idn-nameprep-11.txt>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
skipping to change at page 31, line 23 skipping to change at page 33, line 23
XML", World Wide Web Consortium Recommendation, XML", World Wide Web Consortium Recommendation,
January 1999, <http://www.w3.org/TR/REC-xml#sec- January 1999, <http://www.w3.org/TR/REC-xml#sec-
external-ent>. external-ent>.
[XMLSchema] Biron, P. and A. Malhotra, "XML Schema Part 2: [XMLSchema] Biron, P. and A. Malhotra, "XML Schema Part 2:
Datatypes", World Wide Web Consortium Recommendation, Datatypes", World Wide Web Consortium Recommendation,
May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>. May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>.
Authors' Addresses Authors' Addresses
Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever possible, for example as "D&#252;rst in XML and HTML.) Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever
possible, for example as "D&#252;rst in XML and HTML.)
World Wide Web Consortium World Wide Web Consortium
200 Technology Square 200 Technology Square
Cambridge, MA 02139 Cambridge, MA 02139
U.S.A. U.S.A.
Phone: +1 617 253 5509 Phone: +1 617 253 5509
Fax: +1 617 258 5999 Fax: +1 617 258 5999
EMail: duerst@w3.org EMail: duerst@w3.org
URI: http://www.w3.org/People/D%C3%BCrst/ URI: http://www.w3.org/People/D%C3%BCrst/
(Note: This is the escaped form of an IRI.) (Note: This is the escaped form of an IRI.)
skipping to change at page 32, line 7 skipping to change at page 34, line 7
One Microsoft Way One Microsoft Way
Redmond, WA 98052 Redmond, WA 98052
U.S.A. U.S.A.
Phone: +1 425 882-8080 Phone: +1 425 882-8080
EMail: mailto:michelsu@microsoft.com EMail: mailto:michelsu@microsoft.com
URI: http://www.suignard.com URI: http://www.suignard.com
Full Copyright Statement Full Copyright Statement
Copyright (C) The Internet Society (2002). All Rights Reserved. Copyright (C) The Internet Society (2003). All Rights Reserved.
This document and translations of it may be copied and furnished to This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of Internet organizations, except as needed for the purpose of
 End of changes. 

This html diff was produced by rfcdiff 1.12, available from http://www.levkowetz.com/ietf/tools/rfcdiff/