draft-duerst-iri-08.txt   draft-duerst-iri-09.txt 
Network Working Group M. Duerst Network Working Group M. Duerst
Internet-Draft W3C Internet-Draft W3C
Expires: November 26, 2004 M. Suignard Expires: January 17, 2005 M. Suignard
Microsoft Corporation Microsoft Corporation
May 28, 2004 July 19, 2004
Internationalized Resource Identifiers (IRIs) Internationalized Resource Identifiers (IRIs)
draft-duerst-iri-08 draft-duerst-iri-09
Status of this Memo Status of this Memo
By submitting this Internet-Draft, I certify that any applicable By submitting this Internet-Draft, I certify that any applicable
patent or other IPR claims of which I am aware have been disclosed, patent or other IPR claims of which I am aware have been disclosed,
and any of which I become aware will be disclosed, in accordance with and any of which I become aware will be disclosed, in accordance with
RFC 3668. RFC 3668.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 35 skipping to change at page 1, line 35
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt. http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on November 26, 2004. This Internet-Draft will expire on January 17, 2005.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2004). All Rights Reserved. Copyright (C) The Internet Society (2004). All Rights Reserved.
Abstract Abstract
This document defines a new protocol element, the Internationalized This document defines a new protocol element, the Internationalized
Resource Identifier (IRI), as a complement to the Uniform Resource Resource Identifier (IRI), as a complement to the Uniform Resource
Identifier (URI). An IRI is a sequence of characters from the Identifier (URI). An IRI is a sequence of characters from the
skipping to change at page 2, line 41 skipping to change at page 2, line 41
5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . 23 5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . 23
5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . 23 5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . 23
5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . 24 5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . 24
6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 25 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . 25 6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . 25
6.2 Software Interfaces and Protocols . . . . . . . . . . . . 25 6.2 Software Interfaces and Protocols . . . . . . . . . . . . 25
6.3 Format of URIs and IRIs in Documents and Protocols . . . . 26 6.3 Format of URIs and IRIs in Documents and Protocols . . . . 26
6.4 Use of UTF-8 for Encoding Original Characters . . . . . . 26 6.4 Use of UTF-8 for Encoding Original Characters . . . . . . 26
6.5 Relative IRI References . . . . . . . . . . . . . . . . . 27 6.5 Relative IRI References . . . . . . . . . . . . . . . . . 27
7. URI/IRI Processing Guidelines (informative) . . . . . . . . . 27 7. URI/IRI Processing Guidelines (informative) . . . . . . . . . 27
7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . 27 7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . 28
7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 28 7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 28
7.3 URI/IRI Transfer Between Applications . . . . . . . . . . 29 7.3 URI/IRI Transfer Between Applications . . . . . . . . . . 29
7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 29 7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 29
7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 30 7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 30
7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 30 7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 31
7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . 31 7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . 31
7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 31 7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 32
8. Security Considerations . . . . . . . . . . . . . . . . . . . 32 8. Security Considerations . . . . . . . . . . . . . . . . . . . 33
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 34 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 34
10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 34 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 34
11. References . . . . . . . . . . . . . . . . . . . . . . . . . 34 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 35
11.1 Normative References . . . . . . . . . . . . . . . . . . . . 34 11.1 Normative References . . . . . . . . . . . . . . . . . . . . 35
11.2 Non-normative References . . . . . . . . . . . . . . . . . . 35 11.2 Non-normative References . . . . . . . . . . . . . . . . . . 36
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 38 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 38
A. Design Alternatives . . . . . . . . . . . . . . . . . . . . . 38 A. Design Alternatives . . . . . . . . . . . . . . . . . . . . . 39
A.1 New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 38 A.1 New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 39
A.2 Other Character Encodings than UTF-8 . . . . . . . . . . . 39 A.2 Other Character Encodings than UTF-8 . . . . . . . . . . . 39
A.3 New Encoding Convention . . . . . . . . . . . . . . . . . 39 A.3 New Encoding Convention . . . . . . . . . . . . . . . . . 39
A.4 Indicating Character Encodings in the URI/IRI . . . . . . 39 A.4 Indicating Character Encodings in the URI/IRI . . . . . . 40
Intellectual Property and Copyright Statements . . . . . . . . 40 Intellectual Property and Copyright Statements . . . . . . . . 41
1. Introduction 1. Introduction
1.1 Overview and Motivation 1.1 Overview and Motivation
A Uniform Resource Identifier (URI) is defined in [RFCYYYY] as a A Uniform Resource Identifier (URI) is defined in [RFCYYYY] as a
sequence of characters chosen from a limited subset of the repertoire sequence of characters chosen from a limited subset of the repertoire
of US-ASCII [ASCII] characters. of US-ASCII [ASCII] characters.
The characters in URIs are frequently used for representing words of The characters in URIs are frequently used for representing words of
skipping to change at page 9, line 19 skipping to change at page 9, line 19
irelative-part = "//" iauthority ipath-abempty irelative-part = "//" iauthority ipath-abempty
/ ipath-abs / ipath-abs
/ ipath-noscheme / ipath-noscheme
/ ipath-empty / ipath-empty
iauthority = [ iuserinfo "@" ] ihost [ ":" port ] iauthority = [ iuserinfo "@" ] ihost [ ":" port ]
iuserinfo = *( iunreserved / pct-encoded / sub-delims / ":" ) iuserinfo = *( iunreserved / pct-encoded / sub-delims / ":" )
ihost = IP-literal / IPv4address / ireg-name ihost = IP-literal / IPv4address / ireg-name
ireg-name = 0*255( iunreserved / pct-encoded / sub-delims ) ireg-name = *( iunreserved / pct-encoded / sub-delims )
ipath = ipath-abempty ; begins with "/" or is empty ipath = ipath-abempty ; begins with "/" or is empty
/ ipath-abs ; begins with "/" but not "//" / ipath-abs ; begins with "/" but not "//"
/ ipath-noscheme ; begins with a non-colon segment / ipath-noscheme ; begins with a non-colon segment
/ ipath-rootless ; begins with a segment / ipath-rootless ; begins with a segment
/ ipath-empty ; zero characters / ipath-empty ; zero characters
ipath-abempty = *( "/" isegment ) ipath-abempty = *( "/" isegment )
ipath-abs = "/" [ isegment-nz *( "/" isegment ) ] ipath-abs = "/" [ isegment-nz *( "/" isegment ) ]
ipath-noscheme = isegment-nzc *( "/" isegment ) ipath-noscheme = isegment-nzc *( "/" isegment )
skipping to change at page 13, line 45 skipping to change at page 13, line 45
case of using an HTTP proxy. case of using an HTTP proxy.
Note: Internationalized Domain Names may be contained in parts of an Note: Internationalized Domain Names may be contained in parts of an
IRI other than the ireg-name part. It is the responsibility of IRI other than the ireg-name part. It is the responsibility of
scheme-specific implementations (if the Internationalized Domain scheme-specific implementations (if the Internationalized Domain
Name is part of the scheme syntax) or of server-side Name is part of the scheme syntax) or of server-side
implementations (if the Internationalized Domain Name is part of implementations (if the Internationalized Domain Name is part of
'iquery') to apply the necessary conversions at the appropriate 'iquery') to apply the necessary conversions at the appropriate
point. Example: Trying to validate the Web page at point. Example: Trying to validate the Web page at
http://résumé.example.org would lead to an IRI of http://résumé.example.org would lead to an IRI of
http://validator.w3.org/check?uri=http%3A%2F%2Frésumé.example.org, http://validator.w3.org/check?uri=http%3A%2F%2Frésumé.
which would convert to a URI of example.org, which would convert to a URI of
http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.example.org. http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.
The server side implementation would be responsible to do the example.org. The server side implementation would be responsible
necessary conversions in order to be able to retrieve the Web to do the necessary conversions in order to be able to retrieve
page. the Web page.
Infrastructure accepting IRIs MAY also deal with the printable Infrastructure accepting IRIs MAY also deal with the printable
characters in US-ASCII that are not allowed in URIs, namely "<", ">", characters in US-ASCII that are not allowed in URIs, namely "<", ">",
'"', Space, "{", "}", "|", "\", "^", and "`", in Step 2 above. If '"', Space, "{", "}", "|", "\", "^", and "`", in Step 2 above. If
such characters are found but are not converted, then the conversion such characters are found but are not converted, then the conversion
SHOULD fail. Please note that the number sign ("#"), the percent SHOULD fail. Please note that the number sign ("#"), the percent
sign ("%"), and the square bracket characters ("[", "]") are not part sign ("%"), and the square bracket characters ("[", "]") are not part
of the above list, and MUST NOT be converted. Protocols and formats of the above list, and MUST NOT be converted. Protocols and formats
that have used earlier definitions of IRIs including these characters that have used earlier definitions of IRIs including these characters
MAY require percent-encoding of these characters as a preprocessing MAY require percent-encoding of these characters as a preprocessing
skipping to change at page 26, line 39 skipping to change at page 26, line 39
This section discusses details and gives examples for point c) in This section discusses details and gives examples for point c) in
Section 1.2. In order to be able to use IRIs, the URI corresponding Section 1.2. In order to be able to use IRIs, the URI corresponding
to the IRI in question has to encode original characters into octets to the IRI in question has to encode original characters into octets
using UTF-8. This can be specified for all URIs of a URI scheme, or using UTF-8. This can be specified for all URIs of a URI scheme, or
can apply to individual URIs for schemes that do not specify how to can apply to individual URIs for schemes that do not specify how to
encode original characters. It can apply to the whole URI, or only encode original characters. It can apply to the whole URI, or only
some part. For background information on encoding characters into some part. For background information on encoding characters into
URIs, see also Section 2.5 of [RFCYYYY]. URIs, see also Section 2.5 of [RFCYYYY].
For new URI schemes, using UTF-8 is recommended in [RFC2718]. For new URI schemes, using UTF-8 is recommended in [RFC2718].
Examples where this is already used are the URN syntax [RFC2141], Examples where UTF-8 is already used are the URN syntax [RFC2141],
IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand,
because the HTTP URL scheme does not specify how to encode original because the HTTP URL scheme does not specify how to encode original
characters, only some HTTP URLs can have corresponding but different characters, only some HTTP URLs can have corresponding but different
IRIs. IRIs.
For example, for a document with a URI of For example, for a document with a URI of
http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to
construct a corresponding IRI (in XML notation, see Section 1.4): construct a corresponding IRI (in XML notation, see Section 1.4):
http://www.example.org/r&#xE9;sum&#xE9;.html (&#xE9; stands for the http://www.example.org/r&#xE9;sum&#xE9;.html (&#xE9; stands for the
e-acute character, and %C3%A9 is the UTF-8 encoded and e-acute character, and %C3%A9 is the UTF-8 encoded and
percent-encoded representation of that character). On the other percent-encoded representation of that character). On the other
hand, for a document with a URI of hand, for a document with a URI of
http://www.example.org/r%E9sum%E9.html, the percent-encoding octets http://www.example.org/r%E9sum%E9.html, the percent-encoding octets
cannot be converted to actual characters in an IRI, because the cannot be converted to actual characters in an IRI, because the
percent-encoding is not based on UTF-8. percent-encoding is not based on UTF-8.
This means that for most URI schemes, there is no need to upgrade
their scheme definition in order for them to work with IRIs. The
main case where upgrading a scheme definition may make sense is when
a scheme definition is limited to the use of US-ASCII characters with
no provision to include non-ASCII characters/octets but a desire to
include such characters, or only with provisions that are highly
scheme-specific. An example of such a scheme might be the mailto:
scheme [RFC2368].
This specification does not upgrade any scheme specifications in any
way, this has to be done separately. Also, it should be noted that
there is no such thing as an "IRI scheme"; all IRIs use URI schemes,
and all URI schemes can be used with IRIs, even though in some cases
only by using URIs directly as IRIs, without any conversion.
The requirement for the use of UTF-8 applies to all parts of a URI The requirement for the use of UTF-8 applies to all parts of a URI
(with the potential exception of the ireg-name part, see Section (with the potential exception of the ireg-name part, see Section
3.1). However, it is possible that the capability of IRIs to 3.1). However, it is possible that the capability of IRIs to
represent a wide range of characters directly is used just in some represent a wide range of characters directly is used just in some
parts of the IRI (or IRI reference). The other parts of the IRI may parts of the IRI (or IRI reference). The other parts of the IRI may
only contain US-ASCII characters, or they may not be based on UTF-8. only contain US-ASCII characters, or they may not be based on UTF-8.
They may be based on another character encoding, or they may directly They may be based on another character encoding, or they may directly
encode raw binary data (see also [RFC2397]). encode raw binary data (see also [RFC2397]).
For example, it is possible to have a URI reference of For example, it is possible to have a URI reference of
skipping to change at page 32, line 38 skipping to change at page 33, line 4
example, when setting up a new file-based Web server, using UTF-8 as example, when setting up a new file-based Web server, using UTF-8 as
the character encoding for file names will make the transition to the character encoding for file names will make the transition to
IRIs easier. Likewise, when setting up a new Web form using UTF-8 as IRIs easier. Likewise, when setting up a new Web form using UTF-8 as
the character encoding of the form page, the returned query URIs will the character encoding of the form page, the returned query URIs will
use UTF-8 as the character encoding (unless the user, for whatever use UTF-8 as the character encoding (unless the user, for whatever
reason, changes the character encoding) and will therefore be reason, changes the character encoding) and will therefore be
compatible with IRIs. compatible with IRIs.
These recommendations, when taken together, will allow for the These recommendations, when taken together, will allow for the
extension from URIs to IRIs in order to handle characters other than extension from URIs to IRIs in order to handle characters other than
US-ASCII while minimizing interoperability problems. US-ASCII while minimizing interoperability problems. For
considerations regarding the upgrade of URI scheme definitions,
please see Section 6.4.
8. Security Considerations 8. Security Considerations
The security considerations discussed in [RFCYYYY] also apply to The security considerations discussed in [RFCYYYY] also apply to
IRIs. In addition, the following issues require particular care for IRIs. In addition, the following issues require particular care for
IRIs. IRIs.
Incorrect encoding or decoding can lead to security problems. In Incorrect encoding or decoding can lead to security problems. In
particular, some UTF-8 decoders do not check against overlong byte particular, some UTF-8 decoders do not check against overlong byte
sequences. As an example, a '/' is encoded with the byte 0x2F both sequences. As an example, a '/' is encoded with the byte 0x2F both
skipping to change at page 36, line 21 skipping to change at page 36, line 40
[Duerst01] [Duerst01]
Duerst, M., "Internationalized Resource Identifiers: From Duerst, M., "Internationalized Resource Identifiers: From
Specification to Testing", Proc. 19th International Specification to Testing", Proc. 19th International
Unicode Conference, San Jose , September 2001, Unicode Conference, San Jose , September 2001,
<http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html>. <http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html>.
[Duerst97] [Duerst97]
Duerst, M., "The Properties and Promises of UTF-8", Proc. Duerst, M., "The Properties and Promises of UTF-8", Proc.
11th International Unicode Conference, San Jose , 11th International Unicode Conference, San Jose ,
September 1997, September 1997,
<http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf> <http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/
. IUC11-UTF-8.pdf>.
[Gettys] Gettys, J., "URI Model Consequences", [Gettys] Gettys, J., "URI Model Consequences",
<http://www.w3.org/DesignIssues/ModelConsequences>. <http://www.w3.org/DesignIssues/ModelConsequences>.
[HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01
Specification", World Wide Web Consortium Recommendation, Specification", World Wide Web Consortium Recommendation,
December 1999, December 1999,
<http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2> <http://www.w3.org/TR/REC-html40/appendix/
. notes.html#h-B.2>.
[RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
Atkinson, R., Crispin, M. and P. Svanberg, "The Report of Atkinson, R., Crispin, M. and P. Svanberg, "The Report of
the IAB Character Set Workshop held 29 February - 1 March, the IAB Character Set Workshop held 29 February - 1 March,
1996", RFC 2130, April 1997. 1996", RFC 2130, April 1997.
[RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.
[RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997. [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997.
[RFC2277] Alvestrand, H., "IETF Policy on Character Sets and [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
Languages", BCP 18, RFC 2277, January 1998. Languages", BCP 18, RFC 2277, January 1998.
[RFC2368] Hoffman, P., Masinter, L. and J. Zawinski, "The mailto URL
scheme", RFC 2368, July 1998.
[RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998. [RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998.
[RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform
Resource Identifiers (URI): Generic Syntax", RFC 2396, Resource Identifiers (URI): Generic Syntax", RFC 2396,
August 1998. August 1998.
[RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, August [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, August
1998. 1998.
[RFC2616] Fielding, R., Gettys, J., Mogul, J., Nielsen, H., [RFC2616] Fielding, R., Gettys, J., Mogul, J., Nielsen, H.,
 End of changes. 

This html diff was produced by rfcdiff 1.12, available from http://www.levkowetz.com/ietf/tools/rfcdiff/