draft-duerst-iri-09.txt   draft-duerst-iri-10.txt 
Network Working Group M. Duerst Network Working Group M. Duerst
Internet-Draft W3C Internet-Draft W3C
Expires: January 17, 2005 M. Suignard Expires: March 28, 2005 M. Suignard
Microsoft Corporation Microsoft Corporation
July 19, 2004 September 27, 2004
Internationalized Resource Identifiers (IRIs) Internationalized Resource Identifiers (IRIs)
draft-duerst-iri-09 draft-duerst-iri-10
Status of this Memo Status of this Memo
By submitting this Internet-Draft, I certify that any applicable This document is an Internet-Draft and is subject to all provisions
patent or other IPR claims of which I am aware have been disclosed, of section 3 of RFC 3667. By submitting this Internet-Draft, each
and any of which I become aware will be disclosed, in accordance with author represents that any applicable patent or other IPR claims of
which he or she is aware have been or will be disclosed, and any of
which he or she become aware will be disclosed, in accordance with
RFC 3668. RFC 3668.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as other groups may also distribute working documents as
Internet-Drafts. Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on January 17, 2005. This Internet-Draft will expire on March 28, 2005.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2004). All Rights Reserved. Copyright (C) The Internet Society (2004).
Abstract Abstract
This document defines a new protocol element, the Internationalized This document defines a new protocol element, the Internationalized
Resource Identifier (IRI), as a complement to the Uniform Resource Resource Identifier (IRI), as a complement to the Uniform Resource
Identifier (URI). An IRI is a sequence of characters from the Identifier (URI). An IRI is a sequence of characters from the
Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to
URIs is defined, which means that IRIs can be used instead of URIs URIs is defined, which means that IRIs can be used instead of URIs
where appropriate to identify resources. where appropriate to identify resources.
skipping to change at page 2, line 39 skipping to change at page 2, line 41
5. IRI Equivalence and Comparison . . . . . . . . . . . . . . . . 22 5. IRI Equivalence and Comparison . . . . . . . . . . . . . . . . 22
5.1 Simple String Comparison . . . . . . . . . . . . . . . . . 22 5.1 Simple String Comparison . . . . . . . . . . . . . . . . . 22
5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . 23 5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . 23
5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . 23 5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . 23
5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . 24 5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . 24
6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 25 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . 25 6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . 25
6.2 Software Interfaces and Protocols . . . . . . . . . . . . 25 6.2 Software Interfaces and Protocols . . . . . . . . . . . . 25
6.3 Format of URIs and IRIs in Documents and Protocols . . . . 26 6.3 Format of URIs and IRIs in Documents and Protocols . . . . 26
6.4 Use of UTF-8 for Encoding Original Characters . . . . . . 26 6.4 Use of UTF-8 for Encoding Original Characters . . . . . . 26
6.5 Relative IRI References . . . . . . . . . . . . . . . . . 27 6.5 Relative IRI References . . . . . . . . . . . . . . . . . 28
7. URI/IRI Processing Guidelines (informative) . . . . . . . . . 27 7. URI/IRI Processing Guidelines (informative) . . . . . . . . . 28
7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . 28 7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . 28
7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 28 7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 28
7.3 URI/IRI Transfer Between Applications . . . . . . . . . . 29 7.3 URI/IRI Transfer Between Applications . . . . . . . . . . 29
7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 29 7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 30
7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 30 7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 30
7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 31 7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 31
7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . 31 7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . 31
7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 32 7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 32
8. Security Considerations . . . . . . . . . . . . . . . . . . . 33 8. Security Considerations . . . . . . . . . . . . . . . . . . . 33
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 34 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 34
10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 34 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 34
11. References . . . . . . . . . . . . . . . . . . . . . . . . . 35 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 35
11.1 Normative References . . . . . . . . . . . . . . . . . . . . 35 11.1 Normative References . . . . . . . . . . . . . . . . . . . . 35
11.2 Non-normative References . . . . . . . . . . . . . . . . . . 36 11.2 Non-normative References . . . . . . . . . . . . . . . . . . 36
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 38 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 38
A. Design Alternatives . . . . . . . . . . . . . . . . . . . . . 39 A. Design Alternatives . . . . . . . . . . . . . . . . . . . . . 39
A.1 New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 39 A.1 New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 39
A.2 Other Character Encodings than UTF-8 . . . . . . . . . . . 39 A.2 Other Character Encodings than UTF-8 . . . . . . . . . . . 40
A.3 New Encoding Convention . . . . . . . . . . . . . . . . . 39 A.3 New Encoding Convention . . . . . . . . . . . . . . . . . 40
A.4 Indicating Character Encodings in the URI/IRI . . . . . . 40 A.4 Indicating Character Encodings in the URI/IRI . . . . . . 40
Intellectual Property and Copyright Statements . . . . . . . . 41 Intellectual Property and Copyright Statements . . . . . . . . 41
1. Introduction 1. Introduction
1.1 Overview and Motivation 1.1 Overview and Motivation
A Uniform Resource Identifier (URI) is defined in [RFCYYYY] as a A Uniform Resource Identifier (URI) is defined in [RFCYYYY] as a
sequence of characters chosen from a limited subset of the repertoire sequence of characters chosen from a limited subset of the repertoire
of US-ASCII [ASCII] characters. of US-ASCII [ASCII] characters.
skipping to change at page 5, line 7 skipping to change at page 5, line 7
For discussion of this document, please use the public-iri@w3.org For discussion of this document, please use the public-iri@w3.org
mailing list (publicly archived at mailing list (publicly archived at
http://lists.w3.org/Archives/Public/public-iri/). An issues list for http://lists.w3.org/Archives/Public/public-iri/). An issues list for
this document is maintained at this document is maintained at
http://www.w3.org/International/iri-edit#issues. For more http://www.w3.org/International/iri-edit#issues. For more
information on the topic of this document, please also see [W3CIRI] information on the topic of this document, please also see [W3CIRI]
and [Duerst01]. and [Duerst01].
1.2 Applicability 1.2 Applicability
IRIs are designed to be compatible with recent recommendations for IRIs are designed to be compatible with recommendations for new URI
new URI schemes [RFC2718]. The compatibility is provided by schemes [RFC2718]. The compatibility is provided by specifying a
specifying a well defined and deterministic mapping from the IRI well defined and deterministic mapping from the IRI character
character sequence to the functionally equivalent URI character sequence to the functionally equivalent URI character sequence.
sequence. Practical use of IRIs (or IRI references) in place of URIs Practical use of IRIs (or IRI references) in place of URIs (or URI
(or URI references) depends on the following conditions being met: references) depends on the following conditions being met:
a) The protocol or format element where IRIs are used should be a) The protocol or format element where IRIs are used should be
explicitly designated to be able to carry IRIs. That is, the explicitly designated to be able to carry IRIs. That is, the
intent is not to introduce IRIs into contexts that are not defined intent is not to introduce IRIs into contexts that are not defined
to accept them. For example, XML schema [XMLSchema] has an to accept them. For example, XML schema [XMLSchema] has an
explicit type "anyURI" that includes IRIs and IRI references. explicit type "anyURI" that includes IRIs and IRI references.
Therefore, IRIs and IRI references can be in attributes and Therefore, IRIs and IRI references can be in attributes and
elements of type "anyURI". On the other hand, in the HTTP elements of type "anyURI". On the other hand, in the HTTP
protocol [RFC2616], the Request URI is defined as an URI, which protocol [RFC2616], the Request URI is defined as an URI, which
means that direct use of IRIs is not allowed in HTTP requests. means that direct use of IRIs is not allowed in HTTP requests.
skipping to change at page 6, line 22 skipping to change at page 6, line 22
charset: The name of a parameter or attribute used to identify a charset: The name of a parameter or attribute used to identify a
character encoding. character encoding.
UCS: Universal Character Set; the coded character set defined by ISO/ UCS: Universal Character Set; the coded character set defined by ISO/
IEC 10646 [ISO10646] and the Unicode Standard [UNIV4]. IEC 10646 [ISO10646] and the Unicode Standard [UNIV4].
IRI reference: The term "IRI reference" denotes the common usage of IRI reference: The term "IRI reference" denotes the common usage of
an Internationalized Resource Identifier. An IRI reference may be an Internationalized Resource Identifier. An IRI reference may be
absolute or relative. However, the "IRI" that results from such a absolute or relative. However, the "IRI" that results from such a
reference only includes absolute IRIs; any relative IRIs are reference only includes absolute IRIs; any relative IRI references
resolved to their absolute form. Note that in [RFC2396], URIs did are resolved to their absolute form. Note that in [RFC2396], URIs
not include fragment identifiers, but in [RFCYYYY], fragment did not include fragment identifiers, but in [RFCYYYY], fragment
identifiers are part of URIs. identifiers are part of URIs.
running text: Human text (paragraphs, sentences, phrases) with syntax running text: Human text (paragraphs, sentences, phrases) with syntax
according to orthographic conventions of a natural language, as according to orthographic conventions of a natural language, as
opposed to syntax defined for ease of processing by machines opposed to syntax defined for ease of processing by machines
(markup, programming languages,...). (markup, programming languages,...).
protocol element: Any portion of a message which affects processing protocol element: Any portion of a message which affects processing
of that message by the protocol in question. of that message by the protocol in question.
presentation element: Presentation form corresponding to a protocol presentation element: Presentation form corresponding to a protocol
element, for example using a wider range of characters. element, for example using a wider range of characters.
create (an URI or IRI): With respect to URIs and IRIs, the word create (an URI or IRI): With respect to URIs and IRIs, the word
'create' is used for the initial creation. This may be the 'create' is used for the initial creation. This may be the
initial creation of a resource with a certain name, or the initial initial creation of a resource with a certain identifier, or the
exposition of a resource under a particular name. initial exposition of a resource under a particular identifier.
generate (an URI or IRI): With respect to URIs and IRIs, the word generate (an URI or IRI): With respect to URIs and IRIs, the word
'generate' is used when the IRI is generated by derivation from 'generate' is used when the IRI is generated by derivation from
other information. other information.
1.4 Notation 1.4 Notation
RFCs and Internet Drafts currently do not allow any characters RFCs and Internet Drafts currently do not allow any characters
outside the US-ASCII repertoire. Therefore, this document uses outside the US-ASCII repertoire. Therefore, this document uses
various special notations to denote such characters in examples. various special notations to denote such characters in examples.
skipping to change at page 8, line 8 skipping to change at page 8, line 8
2.1 Summary of IRI Syntax 2.1 Summary of IRI Syntax
IRIs are defined similarly to URIs in [RFCYYYY], but the class of IRIs are defined similarly to URIs in [RFCYYYY], but the class of
unreserved characters is extended by adding the characters of the UCS unreserved characters is extended by adding the characters of the UCS
(Universal Character Set, [ISO10646]) beyond U+007F, subject to the (Universal Character Set, [ISO10646]) beyond U+007F, subject to the
limitations given in the syntax rules below and in Section 6.1. limitations given in the syntax rules below and in Section 6.1.
Otherwise, the syntax and use of components and reserved characters Otherwise, the syntax and use of components and reserved characters
is the same as that in [RFCYYYY]. All the operations defined in is the same as that in [RFCYYYY]. All the operations defined in
[RFCYYYY], such as the resolution of relative URIs, can be applied to [RFCYYYY], such as the resolution of relative references, can be
IRIs by IRI-processing software in exactly the same way as this is applied to IRIs by IRI-processing software in exactly the same way as
done to URIs by URI-processing software. this is done to URIs by URI-processing software.
Characters outside the US-ASCII repertoire are not reserved and Characters outside the US-ASCII repertoire are not reserved and
therefore MUST NOT be used for syntactical purposes such as to therefore MUST NOT be used for syntactical purposes such as to
delimit components in newly defined schemes. As an example, it is delimit components in newly defined schemes. As an example, it is
not allowed to use U+00A2, CENT SIGN, as a delimiter in IRIs, because not allowed to use U+00A2, CENT SIGN, as a delimiter in IRIs, because
it is in the 'iunreserved' category, in the same way as it is not it is in the 'iunreserved' category, in the same way as it is not
possible to use '-' as a delimiter, because it is in the 'unreserved' possible to use '-' as a delimiter, because it is in the 'unreserved'
category in URIs. category in URIs.
2.2 ABNF for IRI References and IRIs 2.2 ABNF for IRI References and IRIs
skipping to change at page 8, line 48 skipping to change at page 8, line 48
of the non-terminals have been changed as follows: If the of the non-terminals have been changed as follows: If the
non-terminal contains 'URI', this has been changed to 'IRI'. non-terminal contains 'URI', this has been changed to 'IRI'.
Otherwise, an 'i' has been prefixed. Otherwise, an 'i' has been prefixed.
The following rules are different from [RFCYYYY]: The following rules are different from [RFCYYYY]:
IRI = scheme ":" ihier-part [ "?" iquery ] IRI = scheme ":" ihier-part [ "?" iquery ]
[ "#" ifragment ] [ "#" ifragment ]
ihier-part = "//" iauthority ipath-abempty ihier-part = "//" iauthority ipath-abempty
/ ipath-abs / ipath-absolute
/ ipath-rootless / ipath-rootless
/ ipath-empty / ipath-empty
IRI-reference = IRI / relative-IRI IRI-reference = IRI / irelative-ref
absolute-IRI = scheme ":" ihier-part [ "?" iquery ] absolute-IRI = scheme ":" ihier-part [ "?" iquery ]
relative-IRI = irelative-part [ "?" iquery ] [ "#" ifragment ] irelative-ref = irelative-part [ "?" iquery ] [ "#" ifragment ]
irelative-part = "//" iauthority ipath-abempty irelative-part = "//" iauthority ipath-abempty
/ ipath-abs / ipath-absolute
/ ipath-noscheme / ipath-noscheme
/ ipath-empty / ipath-empty
iauthority = [ iuserinfo "@" ] ihost [ ":" port ] iauthority = [ iuserinfo "@" ] ihost [ ":" port ]
iuserinfo = *( iunreserved / pct-encoded / sub-delims / ":" ) iuserinfo = *( iunreserved / pct-encoded / sub-delims / ":" )
ihost = IP-literal / IPv4address / ireg-name ihost = IP-literal / IPv4address / ireg-name
ireg-name = *( iunreserved / pct-encoded / sub-delims ) ireg-name = *( iunreserved / pct-encoded / sub-delims )
ipath = ipath-abempty ; begins with "/" or is empty ipath = ipath-abempty ; begins with "/" or is empty
/ ipath-abs ; begins with "/" but not "//" / ipath-absolute ; begins with "/" but not "//"
/ ipath-noscheme ; begins with a non-colon segment / ipath-noscheme ; begins with a non-colon segment
/ ipath-rootless ; begins with a segment / ipath-rootless ; begins with a segment
/ ipath-empty ; zero characters / ipath-empty ; zero characters
ipath-abempty = *( "/" isegment ) ipath-abempty = *( "/" isegment )
ipath-abs = "/" [ isegment-nz *( "/" isegment ) ] ipath-absolute = "/" [ isegment-nz *( "/" isegment ) ]
ipath-noscheme = isegment-nzc *( "/" isegment ) ipath-noscheme = isegment-nz-nc *( "/" isegment )
ipath-rootless = isegment-nz *( "/" isegment ) ipath-rootless = isegment-nz *( "/" isegment )
ipath-empty = 0<ipchar> ipath-empty = 0<ipchar>
isegment = *ipchar isegment = *ipchar
isegment-nz = 1*ipchar isegment-nz = 1*ipchar
isegment-nzc = 1*( iunreserved / pct-encoded / sub-delims isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims
/ "@" ) / "@" )
; non-zero-length segment without any colon ":"
ipchar = iunreserved / pct-encoded / sub-delims / ":" ipchar = iunreserved / pct-encoded / sub-delims / ":"
/ "@" / "@"
iquery = *( ipchar / iprivate / "/" / "?" ) iquery = *( ipchar / iprivate / "/" / "?" )
ifragment = *( ipchar / "/" / "?" ) ifragment = *( ipchar / "/" / "?" )
iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
skipping to change at page 12, line 11 skipping to change at page 12, line 11
of any character encoding: Represent the IRI as a sequence of of any character encoding: Represent the IRI as a sequence of
characters from the UCS normalized according to Normalization characters from the UCS normalized according to Normalization
Form C (NFC, [UTR15]). Form C (NFC, [UTR15]).
Variant B) If the IRI is in some digital representation (e.g. an Variant B) If the IRI is in some digital representation (e.g. an
octet stream) in some known non-Unicode character encoding: octet stream) in some known non-Unicode character encoding:
Convert the IRI to a sequence of characters from the UCS Convert the IRI to a sequence of characters from the UCS
normalized according to NFC. normalized according to NFC.
Variant C) If the IRI is in an Unicode-based character encoding Variant C) If the IRI is in an Unicode-based character encoding
(for example UTF-8 or UTF-16): Do not normalize. Apply Step 2 (for example UTF-8 or UTF-16): Do not normalize (see Section
directly to the encoded Unicode character sequence. 5.3 for details). Apply Step 2 directly to the encoded Unicode
character sequence.
Step 2) For each character in 'ucschar' or 'iprivate', apply Steps Step 2) For each character in 'ucschar' or 'iprivate', apply Steps
2.1 through 2.3 below. 2.1 through 2.3 below.
2.1) Convert the character to a sequence of one or more octets 2.1) Convert the character to a sequence of one or more octets
using UTF-8 [RFC3629]. using UTF-8 [RFC3629].
2.2) Convert each octet to %HH, where HH is the hexadecimal 2.2) Convert each octet to %HH, where HH is the hexadecimal
notation of the octet value. Note that this is identical to notation of the octet value. Note that this is identical to
the percent-encoding mechanism in Section 2.1 of [RFCYYYY]. To the percent-encoding mechanism in Section 2.1 of [RFCYYYY]. To
reduce variability, the hexadecimal notation SHOULD use upper reduce variability, the hexadecimal notation SHOULD use upper
case letters. case letters.
2.3) Replace the original character by the resulting character 2.3) Replace the original character with the resulting character
sequence (i.e. a sequence of %HH triplets). sequence (i.e., a sequence of %HH triplets).
The above mapping from IRIs to URIs produces URIs fully conforming to The above mapping from IRIs to URIs produces URIs fully conforming to
[RFCYYYY]. The mapping is also an identity transformation for URIs [RFCYYYY]. The mapping is also an identity transformation for URIs
and is idempotent -- applying the mapping a second time will not and is idempotent -- applying the mapping a second time will not
change anything. Every URI is by definition an IRI. change anything. Every URI is by definition an IRI.
Infrastructure accepting IRIs MAY convert the ireg-name component of Infrastructure accepting IRIs MAY convert the ireg-name component of
an IRI as follows (before Step 2 above) for schemes that are known to an IRI as follows (before Step 2 above) for schemes that are known to
use domain names in ireg-name, but where the scheme definition does use domain names in ireg-name, but where the scheme definition does
not allow percent-encoding for ireg-name: Replace the ireg-name part not allow percent-encoding for ireg-name: Replace the ireg-name part
skipping to change at page 13, line 7 skipping to change at page 13, line 8
the IRI the IRI
http://r&#xE9;sum&#xE9;.example.org may be converted to http://r&#xE9;sum&#xE9;.example.org may be converted to
http://xn--rsum-bpad.example.org instead of http://xn--rsum-bpad.example.org instead of
http://r%C3%A9sum%C3%A9.example.org. http://r%C3%A9sum%C3%A9.example.org.
An IRI with a scheme that is known to use domain names in ireg-name, An IRI with a scheme that is known to use domain names in ireg-name,
but where the scheme definition does not allow percent-encoding for but where the scheme definition does not allow percent-encoding for
ireg-name, meets scheme-specific restrictions if either the ireg-name, meets scheme-specific restrictions if either the
straightforward conversion or the conversion using the ToASCII straightforward conversion or the conversion using the ToASCII
operation on ireg-name result in an URI that meets the operation on ireg-name result in an URI that meets the
scheme-specific restrictions. An IRI with a scheme that is known to scheme-specific restrictions. Such an IRI resolves to the URI
use domain names in ireg-name, but where the scheme definition does
not allow percent-encoding for ireg-name, resolves to the URI
obtained after converting the IRI including using the ToASCII obtained after converting the IRI including using the ToASCII
operation on ireg-name. Implementations do not need to do this operation on ireg-name. Implementations do not need to do this
conversion as long as they produce the same result. conversion as long as they produce the same result.
Note: The difference between Variants B and C in Step 1 (Variant B Note: The difference between Variants B and C in Step 1 (Variant B
using normalization with NFC while Variant C not using any using normalization with NFC while Variant C not using any
normalization) is to account for the fact that in many non-Unicode normalization) is to account for the fact that in many non-Unicode
character encodings, some text cannot be represented directly. character encodings, some text cannot be represented directly.
For example, Vietnam is natively written "Vi&#x1EC7;t Nam" For example, Vietnam is natively written "Vi&#x1EC7;t Nam"
(containing a LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW) (containing a LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW)
skipping to change at page 18, line 23 skipping to change at page 18, line 23
LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP
DIRECTIONAL FORMATTING (PDF). Setting the embedding direction can DIRECTIONAL FORMATTING (PDF). Setting the embedding direction can
also be done in a higher-level protocol (e.g. the dir='ltr' also be done in a higher-level protocol (e.g. the dir='ltr'
attribute in HTML). attribute in HTML).
There is no requirement to actually use the above embedding if the There is no requirement to actually use the above embedding if the
display is still the same without the embedding. For example, a display is still the same without the embedding. For example, a
bidirectional IRI in a text with left-to-right base directionality bidirectional IRI in a text with left-to-right base directionality
(such as used for English or Cyrillic) that is preceded and followed (such as used for English or Cyrillic) that is preceded and followed
by whitespace and strong left-to-right characters does not need an by whitespace and strong left-to-right characters does not need an
embedding. Also, a bidirectional relative IRI that only contains embedding. Also, a bidirectional relative IRI reference that only
strong right-to-left characters and weak characters and that starts contains strong right-to-left characters and weak characters and that
and ends with a strong rigth-to-left character and appears in a text starts and ends with a strong rigth-to-left character and appears in
with right-to-left base directionality (such as used for Arabic or a text with right-to-left base directionality (such as used for
Hebrew) and is preceded and followed by whitespace and strong Arabic or Hebrew) and is preceded and followed by whitespace and
characters does not need an embedding. strong characters does not need an embedding.
In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM) may be In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM) may be
sufficient to force the correct display behavior. However, the sufficient to force the correct display behavior. However, the
details of the Unicode Bidirectional algorithm are not always easy to details of the Unicode Bidirectional algorithm are not always easy to
understand. Implementers are strongly advised to err on the side of understand. Implementers are strongly advised to err on the side of
caution and to use embedding in all cases where they are not caution and to use embedding in all cases where they are not
completely sure that the display behavior is unaffected without the completely sure that the display behavior is unaffected without the
embedding. embedding.
The Unicode Bidirectional Algorithm ([UNI9], Section 4.3) permits The Unicode Bidirectional Algorithm ([UNI9], Section 4.3) permits
skipping to change at page 19, line 17 skipping to change at page 19, line 17
The Unicode Bidirectional Algorithm is designed mainly for running The Unicode Bidirectional Algorithm is designed mainly for running
text. To make sure that it does not affect the rendering of text. To make sure that it does not affect the rendering of
bidirectional IRIs too much, some restrictions on bidirectional IRIs bidirectional IRIs too much, some restrictions on bidirectional IRIs
are necessary. These restrictions are given in terms of delimiters are necessary. These restrictions are given in terms of delimiters
(structural characters, mostly punctuation such as '@', '.', ':', (structural characters, mostly punctuation such as '@', '.', ':',
'/') and components (usually consisting mostly of letters and '/') and components (usually consisting mostly of letters and
digits). digits).
The following syntax rules from Section 2.2 correspond to components The following syntax rules from Section 2.2 correspond to components
for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment, for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment,
isegment-nz, isegment-nzc, ireg-name, iquery, and ifragment. isegment-nz, isegment-nz-nc, ireg-name, iquery, and ifragment.
Specifications that define the syntax of any of the above components Specifications that define the syntax of any of the above components
MAY divide them further and define smaller parts to be components MAY divide them further and define smaller parts to be components
according to this document. As an example, the restrictions of according to this document. As an example, the restrictions of
[RFC3490] on bidirectional domain names correspond to treating each [RFC3490] on bidirectional domain names correspond to treating each
label of a domain name as a component for those schemes where label of a domain name as a component for those schemes where
ireg-name is a domain name. Even where the components are not ireg-name is a domain name. Even where the components are not
defined formally, it may be helpful to think about some syntax in defined formally, it may be helpful to think about some syntax in
terms of components and to apply the relevant restrictions. For terms of components and to apply the relevant restrictions. For
example, for the usual name/value syntax in query parts, it is example, for the usual name/value syntax in query parts, it is
skipping to change at page 21, line 50 skipping to change at page 21, line 50
logical representation: http://ab.cd.ef/GH1/2IJ/KL.html logical representation: http://ab.cd.ef/GH1/2IJ/KL.html
visual representation: http://ab.cd.ef/LK/JI1/2HG.html visual representation: http://ab.cd.ef/LK/JI1/2HG.html
The sequence '1/2' is interpreted by the bidi algorithm as a The sequence '1/2' is interpreted by the bidi algorithm as a
fraction, fragmenting the components and leading to confusion. There fraction, fragmenting the components and leading to confusion. There
are other characters that are interpreted in a special way close to are other characters that are interpreted in a special way close to
numbers, in particular '+', '-', '#', '$', '%', ',', '.', and ':'. numbers, in particular '+', '-', '#', '$', '%', ',', '.', and ':'.
Example 9 (not allowed): The numbers in the previous example are Example 9 (not allowed): The numbers in the previous example are
percent-encoded: percent-encoded:
logical representation: http://ab.cd.ef/GH%31/%32IJ/KL.html, logical representation: http://ab.cd.ef/GH%31/%32IJ/KL.html,
visual representation (Hebrew): http://ab.cd.ef/LK/JI%32/%31HG.html visual representation (Hebrew): http://ab.cd.ef/%31HG/LK/JI%32.html
visual representation (Arabic): http://ab.cd.ef/LK/JI32%/31%HG.html visual representation (Arabic): http://ab.cd.ef/31%HG/%LK/JI32.html
Depending on whether the upper-case letters represent Arabic or Depending on whether the upper-case letters represent Arabic or
Hebrew, the visual representation is different. Hebrew, the visual representation is different.
Example 10 (allowed, but not recommended): Example 10 (allowed, but not recommended):
logical representation: http://ab.CDEFGH.123/kl/mn/op.html logical representation: http://ab.CDEFGH.123/kl/mn/op.html
visual representation: http://ab.123.HGFEDC/kl/mn/op.html visual representation: http://ab.123.HGFEDC/kl/mn/op.html
Components consisting of only numbers are allowed (it would be rather Components consisting of only numbers are allowed (it would be rather
difficult to prohibit them), but may interact with adjacent RTL difficult to prohibit them), but may interact with adjacent RTL
components in ways that are not easy to predict. components in ways that are not easy to predict.
skipping to change at page 23, line 19 skipping to change at page 23, line 19
For actual resolution, differences in percent-encoding (except for For actual resolution, differences in percent-encoding (except for
the percent-encoding of reserved characters) MUST always result in the percent-encoding of reserved characters) MUST always result in
the same resource. For example, http://example.org/~user, the same resource. For example, http://example.org/~user,
http://example.org/%7euser and http://example.org/%7Euser must http://example.org/%7euser and http://example.org/%7Euser must
resolve to the same resource. resolve to the same resource.
If this kind of equivalence is to be tested, the percent-encoding of If this kind of equivalence is to be tested, the percent-encoding of
both IRIs to be compared has to be aligned, for example by converting both IRIs to be compared has to be aligned, for example by converting
both IRIs to URIs (see Section 3.1), eliminating escape differences both IRIs to URIs (see Section 3.1), eliminating escape differences
in the resulting URIs, and making sure that the case of the in the resulting URIs, and making sure that the case of the
hexadecimal characters in the percent-encodeing is always the same hexadecimal characters in the percent-encoding is always the same
(preferably upper case). If the IRI is to be passed to another (preferably upper case). If the IRI is to be passed to another
application, or used further in some other way, its original form application, or used further in some other way, its original form
MUST be preserved; the conversion described here should be performed MUST be preserved; the conversion described here should be performed
only for the purpose of local comparison. only for the purpose of local comparison.
Additional, similar equivalences are possible based on knowledge Additional, similar equivalences are possible based on knowledge
about the generic URI/IRI syntax, such as the fact that the scheme about the generic URI/IRI syntax, such as the fact that the scheme
part is case-insensitive. part is case-insensitive.
5.3 Normalization 5.3 Normalization
skipping to change at page 24, line 5 skipping to change at page 24, line 5
UCS-based character encoding. In these cases, NFC or a normalizing UCS-based character encoding. In these cases, NFC or a normalizing
transcoder using NFC MUST be used for interoperability. To avoid transcoder using NFC MUST be used for interoperability. To avoid
false negatives and problems with transcoding, IRIs SHOULD be created false negatives and problems with transcoding, IRIs SHOULD be created
using NFC. Using NFKC may avoid even more problems, for example by using NFC. Using NFKC may avoid even more problems, for example by
choosing half-width Latin letters instead of full-width, and choosing half-width Latin letters instead of full-width, and
full-width Katakana instead of half-width. full-width Katakana instead of half-width.
As an example, http://www.example.org/r&#xE9;sum&#xE9;.html (in XML As an example, http://www.example.org/r&#xE9;sum&#xE9;.html (in XML
Notation) is in NFC. On the other hand, Notation) is in NFC. On the other hand,
http://www.example.org/re&#x301;sume&#x301;.html is not in NFC. The http://www.example.org/re&#x301;sume&#x301;.html is not in NFC. The
former uses precombined e-acute characters, the later uses 'e' former uses precombined e-acute characters, the latter uses 'e'
characters followed by combining acute accents. Both usages are characters followed by combining acute accents. Both usages are
defined to be canonically equivalent in [UNIV4]. defined to be canonically equivalent in [UNIV4].
Note: Because it is unknown how a particular field is being treated Note: Because it is unknown how a particular field is being treated
with respect to text normalization, it would be inappropriate to with respect to text normalization, it would be inappropriate to
allow third parties to normalize an IRI arbitrarily. This does allow third parties to normalize an IRI arbitrarily. This does
not contradict the recommendation that when a resource is created, not contradict the recommendation that when a resource is created,
its IRI should be as normalized as possible (i.e. NFC or even its IRI should be as normalized as possible (i.e. NFC or even
NFKC). This is similar to the upper-case/lower-case problems in NFKC). This is similar to the upper-case/lower-case problems in
URIs. Some parts of a URI are case-insensitive (domain name). URIs. Some parts of a URI are case-insensitive (domain name).
skipping to change at page 24, line 49 skipping to change at page 24, line 49
- Always use uppercase A-through-F characters when percent-encoding. - Always use uppercase A-through-F characters when percent-encoding.
- For those schemes where ireg-name is a domain name, always provide - For those schemes where ireg-name is a domain name, always provide
the individual labels, in the form produced when applying nameprep the individual labels, in the form produced when applying nameprep
[RFC3491]. This in particular includes using lowercase characters [RFC3491]. This in particular includes using lowercase characters
rather than uppercase characters where applicable. Also, always rather than uppercase characters where applicable. Also, always
use US-ASCII '.' as a separator. use US-ASCII '.' as a separator.
- Where possible, provide IRI components in NFKC or NFC. - Where possible, provide IRI components in NFKC or NFC.
- Prevent /./ and /../ from appearing in non-relative URI paths. - Prevent /./ and /../ from appearing in IRI paths.
- For schemes that define an empty path to be equivalent to a path - For schemes that define an empty path to be equivalent to a path
of "/", use "/". of "/", use "/".
6. Use of IRIs 6. Use of IRIs
6.1 Limitations on UCS Characters Allowed in IRIs 6.1 Limitations on UCS Characters Allowed in IRIs
This section discusses limitations on characters and character This section discusses limitations on characters and character
sequences usable for IRIs beyond those given in Section 2.2 and sequences usable for IRIs beyond those given in Section 2.2 and
skipping to change at page 27, line 9 skipping to change at page 27, line 9
http://www.example.org/r&#xE9;sum&#xE9;.html (&#xE9; stands for the http://www.example.org/r&#xE9;sum&#xE9;.html (&#xE9; stands for the
e-acute character, and %C3%A9 is the UTF-8 encoded and e-acute character, and %C3%A9 is the UTF-8 encoded and
percent-encoded representation of that character). On the other percent-encoded representation of that character). On the other
hand, for a document with a URI of hand, for a document with a URI of
http://www.example.org/r%E9sum%E9.html, the percent-encoding octets http://www.example.org/r%E9sum%E9.html, the percent-encoding octets
cannot be converted to actual characters in an IRI, because the cannot be converted to actual characters in an IRI, because the
percent-encoding is not based on UTF-8. percent-encoding is not based on UTF-8.
This means that for most URI schemes, there is no need to upgrade This means that for most URI schemes, there is no need to upgrade
their scheme definition in order for them to work with IRIs. The their scheme definition in order for them to work with IRIs. The
main case where upgrading a scheme definition may make sense is when main case where upgrading a scheme definition makes sense is when a
a scheme definition is limited to the use of US-ASCII characters with scheme definition, or a particular component of a scheme, is strictly
no provision to include non-ASCII characters/octets but a desire to limited to the use of US-ASCII characters with no provision to
include such characters, or only with provisions that are highly include non-ASCII characters/octets via percent-encoding, or if a
scheme-specific. An example of such a scheme might be the mailto: scheme definition currently uses highly scheme-specific provisions
scheme [RFC2368]. for the encoding of non-ASCII characters. An example of such a
scheme might be the mailto: scheme [RFC2368].
This specification does not upgrade any scheme specifications in any This specification does not upgrade any scheme specifications in any
way, this has to be done separately. Also, it should be noted that way, this has to be done separately. Also, it should be noted that
there is no such thing as an "IRI scheme"; all IRIs use URI schemes, there is no such thing as an "IRI scheme"; all IRIs use URI schemes,
and all URI schemes can be used with IRIs, even though in some cases and all URI schemes can be used with IRIs, even though in some cases
only by using URIs directly as IRIs, without any conversion. only by using URIs directly as IRIs, without any conversion.
URI schemes can impose restrictions on the syntax of scheme-specific
URIs, ie. URIs that are admissable under the generic URI syntax
[RFCYYYY] may not be admissable due to narrower syntactic constraints
imposed by a URI scheme specification. URI scheme definitions cannot
broaden the syntactic restrictions of the generic URI syntax,
otherwise it would be possible to generate URIs that satisfied the
scheme specific syntactic constraints without satisfying the
syntactic constraints of the generic URI syntax. However, additional
syntactic constraints imposed by URI scheme specifications are
applicable to IRI since the corresponding URI resulting from the
mapping defined in Section 3.1 MUST be a valid URI under the
syntactic restrictions of generic URI syntax and any narrower
restrictions imposed by the corresponding URI scheme specification.
The requirement for the use of UTF-8 applies to all parts of a URI The requirement for the use of UTF-8 applies to all parts of a URI
(with the potential exception of the ireg-name part, see Section (with the potential exception of the ireg-name part, see Section
3.1). However, it is possible that the capability of IRIs to 3.1). However, it is possible that the capability of IRIs to
represent a wide range of characters directly is used just in some represent a wide range of characters directly is used just in some
parts of the IRI (or IRI reference). The other parts of the IRI may parts of the IRI (or IRI reference). The other parts of the IRI may
only contain US-ASCII characters, or they may not be based on UTF-8. only contain US-ASCII characters, or they may not be based on UTF-8.
They may be based on another character encoding, or they may directly They may be based on another character encoding, or they may directly
encode raw binary data (see also [RFC2397]). encode raw binary data (see also [RFC2397]).
For example, it is possible to have a URI reference of For example, it is possible to have a URI reference of
skipping to change at page 27, line 44 skipping to change at page 28, line 11
the fragment identifier is encoded in UTF-8 according to [XPointer]. the fragment identifier is encoded in UTF-8 according to [XPointer].
The IRI corresponding to the above URI would be (in XML notation) The IRI corresponding to the above URI would be (in XML notation)
http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9;. http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9;.
Similar considerations apply to query parts. The functionality of Similar considerations apply to query parts. The functionality of
IRIs (namely to be able to include non-ASCII characters) can only be IRIs (namely to be able to include non-ASCII characters) can only be
used if the query part is encoded in UTF-8. used if the query part is encoded in UTF-8.
6.5 Relative IRI References 6.5 Relative IRI References
Processing of relative forms of IRIs against a base is handled Processing of relative IRI references against a base is handled
straightforwardly; the algorithms of [RFCYYYY] can be applied straightforwardly; the algorithms of [RFCYYYY] can be applied
directly, treating the characters additionally allowed in IRIs in the directly, treating the characters additionally allowed in IRI
same way as unreserved characters in URIs. references in the same way as unreserved characters in URI
references.
7. URI/IRI Processing Guidelines (informative) 7. URI/IRI Processing Guidelines (informative)
This informative section provides guidelines for supporting IRIs in This informative section provides guidelines for supporting IRIs in
the same software components and operations that currently process the same software components and operations that currently process
URIs: software interfaces that handle URIs, software that allows URIs: software interfaces that handle URIs, software that allows
users to enter URIs, software that creates or generates URIs, users to enter URIs, software that creates or generates URIs,
software that displays URIs, formats and protocols that transport software that displays URIs, formats and protocols that transport
URIs, and software that interprets URIs. These may all require more URIs, and software that interprets URIs. These may all require more
or less modification before functioning properly with IRIs. The or less modification before functioning properly with IRIs. The
skipping to change at page 35, line 49 skipping to change at page 36, line 17
Profile for Internationalized Domain Names (IDN)", RFC Profile for Internationalized Domain Names (IDN)", RFC
3491, March 2003. 3491, March 2003.
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
10646", STD 63, RFC 3629, November 2003. 10646", STD 63, RFC 3629, November 2003.
[RFCYYYY] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform [RFCYYYY] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform
Resource Identifier (URI): Generic Syntax (Note to the RFC Resource Identifier (URI): Generic Syntax (Note to the RFC
Editor: Please update this reference with the RFC Editor: Please update this reference with the RFC
resulting from draft-fielding-uri-rfc2396bis-xx.txt, and resulting from draft-fielding-uri-rfc2396bis-xx.txt, and
remove this Note)", draft-fielding-uri-rfc2396bis-05.txt remove this Note)", draft-fielding-uri-rfc2396bis-07.txt
(work in progress), April 2004. (work in progress), April 2004.
[UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard
Annex #9, March 2004, Annex #9, March 2004,
<http://www.unicode.org/reports/tr9/tr9-13.html>. <http://www.unicode.org/reports/tr9/tr9-13.html>.
[UNIV4] The Unicode Consortium, "The Unicode Standard, Version [UNIV4] The Unicode Consortium, "The Unicode Standard, Version
4.0.1, defined by: The Unicode Standard, Version 4.0 4.0.1, defined by: The Unicode Standard, Version 4.0
(Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1), (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1),
as amended by Unicode 4.0.1 as amended by Unicode 4.0.1
(http://www.unicode.org/versions/Unicode4.0.1/)", March (http://www.unicode.org/versions/Unicode4.0.1/)", March
2004. 2004.
[UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms",
Unicode Standard Annex #15, April 2003, Unicode Standard Annex #15, April 2003,
<http://www.unicode.org/unicode/reports/tr15/tr15-23.html>. <http://www.unicode.org/unicode/reports/tr15/
tr15-23.html>.
11.2 Non-normative References 11.2 Non-normative References
[BidiEx] "Examples of bidirectional IRIs", [BidiEx] "Examples of bidirectional IRIs",
<http://www.w3.org/International/iri-edit/BidiExamples>. <http://www.w3.org/International/iri-edit/BidiExamples>.
[CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M. and T. [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M. and T.
Texin, "Character Model for the World Wide Web", World Texin, "Character Model for the World Wide Web", World
Wide Web Consortium Working Draft, February 2004, <http:// Wide Web Consortium Working Draft, February 2004,
www.w3.org/TR/charmod>. <http://www.w3.org/TR/charmod>.
[Duerst01] [Duerst01]
Duerst, M., "Internationalized Resource Identifiers: From Duerst, M., "Internationalized Resource Identifiers: From
Specification to Testing", Proc. 19th International Specification to Testing", Proc. 19th International
Unicode Conference, San Jose , September 2001, Unicode Conference, San Jose , September 2001,
<http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html>. <http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html>.
[Duerst97] [Duerst97]
Duerst, M., "The Properties and Promises of UTF-8", Proc. Duerst, M., "The Properties and Promises of UTF-8", Proc.
11th International Unicode Conference, San Jose , 11th International Unicode Conference, San Jose ,
 End of changes. 

This html diff was produced by rfcdiff 1.12, available from http://www.levkowetz.com/ietf/tools/rfcdiff/