| draft-duerst-iri-02.txt | draft-duerst-iri-03.txt | |||
|---|---|---|---|---|
| | ||||
| Network Working Group M. Duerst | Network Working Group M. Duerst | |||
| Internet-Draft W3C | Internet-Draft W3C | |||
| Expires: May 4, 2003 M. Suignard | Expires: August 31, 2003 M. Suignard | |||
| Microsoft Corporation | Microsoft Corporation | |||
| November 3, 2002 | March 2, 2003 | |||
| Internationalized Resource Identifiers (IRIs) | Internationalized Resource Identifiers (IRIs) | |||
| draft-duerst-iri-02 | draft-duerst-iri-03 | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with | This document is an Internet-Draft and is in full conformance with | |||
| all provisions of Section 10 of RFC2026. | all provisions of Section 10 of RFC2026. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
| other groups may also distribute working documents as Internet- | other groups may also distribute working documents as Internet- | |||
| Drafts. | Drafts. | |||
| skipping to change at page 1, line 34 | skipping to change at page 1, line 33 | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at http:// | The list of current Internet-Drafts can be accessed at http:// | |||
| www.ietf.org/ietf/1id-abstracts.txt. | www.ietf.org/ietf/1id-abstracts.txt. | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
| This Internet-Draft will expire on May 4, 2003. | This Internet-Draft will expire on August 31, 2003. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (2002). All Rights Reserved. | Copyright (C) The Internet Society (2003). All Rights Reserved. | |||
| Abstract | Abstract | |||
| This document defines a new protocol element, the Internationalized | This document defines a new protocol element, the Internationalized | |||
| Resource Identifier (IRI), as a complement to the URI [RFC2396]. An | Resource Identifier (IRI), as a complement to the URI [RFC2396]. An | |||
| IRI is a sequence of characters from the Universal Character Set | IRI is a sequence of characters from the Universal Character Set | |||
| [ISO10646]. A mapping from IRIs to URIs is defined, which means that | [ISO10646]. A mapping from IRIs to URIs is defined, which means that | |||
| IRIs can be used instead of URIs where appropriate to identify | IRIs can be used instead of URIs where appropriate to identify | |||
| resources. | resources. | |||
| skipping to change at page 2, line 16 | skipping to change at page 2, line 16 | |||
| formats, and software components that now deal with URIs are | formats, and software components that now deal with URIs are | |||
| provided. | provided. | |||
| NOTE | NOTE | |||
| This document is a product of the Internationalization Working Group | This document is a product of the Internationalization Working Group | |||
| (I18N WG) of the World Wide Web Consortium (W3C). For general | (I18N WG) of the World Wide Web Consortium (W3C). For general | |||
| discussion, please use the www-international@w3.org mailing list | discussion, please use the www-international@w3.org mailing list | |||
| (publicly archived at http://lists.w3.org/Archives/Public/www- | (publicly archived at http://lists.w3.org/Archives/Public/www- | |||
| international/). For more information on the topic of this document, | international/). For more information on the topic of this document, | |||
| please also see [W3CIRI] and [Duer01]. | please also see [W3CIRI] and [Duerst01]. | |||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . . . 4 | 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . . 4 | |||
| 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 5 | 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . 7 | |||
| 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . . . 7 | 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . . 7 | |||
| 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . . . 7 | 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . . 7 | |||
| 2.3 IRI Equivalence and Normalization . . . . . . . . . . . . . . 10 | 2.3 IRI Equivalence and Normalization . . . . . . . . . . . . . 10 | |||
| 3. Relationship between IRIs and URIs . . . . . . . . . . . . . . 12 | 3. Relationship between IRIs and URIs . . . . . . . . . . . . . 11 | |||
| 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . . . 12 | 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . . 12 | |||
| 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . . . 14 | 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . . 14 | |||
| 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . . 15 | 3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 15 | |||
| 4.1 Logical Storage and Visual Presentation . . . . . . . . . . . 15 | 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . 16 | |||
| 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . . 16 | 4.1 Logical Storage and Visual Presentation . . . . . . . . . . 17 | |||
| 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . . 17 | 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . 17 | |||
| 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 | 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . 18 | |||
| 5. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 19 | 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 18 | |||
| 5.1 Limitations on UCS Characters Allowed in IRIs . . . . . . . . 19 | 5. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . 20 | |||
| 5.2 Software Interfaces and Protocols . . . . . . . . . . . . . . 20 | 5.1 Limitations on UCS Characters Allowed in IRIs . . . . . . . 20 | |||
| 5.3 Format of URIs and IRIs in Documents and Protocols . . . . . . 20 | 5.2 Software Interfaces and Protocols . . . . . . . . . . . . . 21 | |||
| 5.4 Relative IRI References . . . . . . . . . . . . . . . . . . . 21 | 5.3 Format of URIs and IRIs in Documents and Protocols . . . . . 21 | |||
| 6. URI/IRI Processing Guidelines (informative) . . . . . . . . . 21 | 5.4 Relative IRI References . . . . . . . . . . . . . . . . . . 22 | |||
| 6.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . . . 21 | 6. URI/IRI Processing Guidelines (informative) . . . . . . . . 22 | |||
| 6.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . . . 21 | 6.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . . 22 | |||
| 6.3 URI/IRI Transfer Between Applications . . . . . . . . . . . . 22 | 6.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . . 23 | |||
| 6.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . . . 23 | 6.3 URI/IRI Transfer Between Applications . . . . . . . . . . . 23 | |||
| 6.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . . . 23 | 6.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . . 24 | |||
| 6.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . . . 24 | 6.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . . 24 | |||
| 6.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . . . 24 | 6.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . . 25 | |||
| 6.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . . . 25 | 6.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . . 25 | |||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . . 26 | 6.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . . 26 | |||
| 8. Change log . . . . . . . . . . . . . . . . . . . . . . . . . . 27 | 7. Security Considerations . . . . . . . . . . . . . . . . . . 27 | |||
| 8.1 Changes from -01 to -02 . . . . . . . . . . . . . . . . . . . 27 | 8. Issues List . . . . . . . . . . . . . . . . . . . . . . . . 28 | |||
| 8.2 Changes from -00 to -01 . . . . . . . . . . . . . . . . . . . 27 | 9. Change log . . . . . . . . . . . . . . . . . . . . . . . . . 28 | |||
| 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 27 | 9.1 Changes from -02 to -03 . . . . . . . . . . . . . . . . . . 28 | |||
| Normative References . . . . . . . . . . . . . . . . . . . . . 28 | 9.2 Changes from -01 to -02 . . . . . . . . . . . . . . . . . . 29 | |||
| Non-normative References . . . . . . . . . . . . . . . . . . . 29 | 9.3 Changes from -00 to -01 . . . . . . . . . . . . . . . . . . 29 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 31 | 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 29 | |||
| Full Copyright Statement . . . . . . . . . . . . . . . . . . . 32 | Normative References . . . . . . . . . . . . . . . . . . . . 30 | |||
| Non-normative References . . . . . . . . . . . . . . . . . . 31 | ||||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 33 | ||||
| Full Copyright Statement . . . . . . . . . . . . . . . . . . 34 | ||||
| 1. Introduction | 1. Introduction | |||
| 1.1 Overview and Motivation | 1.1 Overview and Motivation | |||
| A URI is defined in [RFC2396] as a sequence of characters chosen from | A URI is defined in [RFC2396] as a sequence of characters chosen from | |||
| a limited subset of the repertoire of US-ASCII characters. | a limited subset of the repertoire of US-ASCII characters. | |||
| The characters in URIs are frequently used for representing words of | The characters in URIs are frequently used for representing words of | |||
| natural languages. Such usage has many advantages: such URIs are | natural languages. Such usage has many advantages: such URIs are | |||
| skipping to change at page 5, line 26 | skipping to change at page 5, line 26 | |||
| UTF-8. For new URI schemes, this is recommended in [RFC2718]. | UTF-8. For new URI schemes, this is recommended in [RFC2718]. | |||
| This allows IRIs to be used with the URN syntax [RFC2141] as | This allows IRIs to be used with the URN syntax [RFC2141] as | |||
| well as recent URL scheme definitions based on UTF-8, such as | well as recent URL scheme definitions based on UTF-8, such as | |||
| IMAP URLs [RFC2192] and POP URLs [RFC2384]. | IMAP URLs [RFC2192] and POP URLs [RFC2384]. | |||
| In cases and for pieces where an encoding other than UTF-8 is used, | In cases and for pieces where an encoding other than UTF-8 is used, | |||
| and for raw binary data encoded in URIs (see [RFC2397]), the octets | and for raw binary data encoded in URIs (see [RFC2397]), the octets | |||
| have to be %-escaped. In these situations, the ability of IRIs to | have to be %-escaped. In these situations, the ability of IRIs to | |||
| directly represent a wide character repertoire cannot be used. | directly represent a wide character repertoire cannot be used. | |||
| For example, for a document with a URI of http://www.example.org/ | For example, for a document with a URI of | |||
| r%C3%A9sum%C3%A9.html, it is possible to construct a corresponding | http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to | |||
| IRI (in XML notation): http://www.example.org/résumé.html | construct a corresponding IRI (in XML notation, see Section 1.4): | |||
| (é stands for the e-acute character, and is the UTF-8 encoded | http://www.example.org/résumé.html (é stands for the | |||
| and escaped representation of that character). On the other hand, | e-acute character, and is the UTF-8 encoded and escaped | |||
| for a document with an URI of http://www.example.org/r%e9sum%e9.html, | representation of that character). On the other hand, for a document | |||
| the escaped octets cannot be converted to actual characters in an | with an URI of http://www.example.org/r%E9sum%E9.html, the escaped | |||
| IRI, because the escaping is based on iso-8859-1 rather than UTF-8. | octets cannot be converted to actual characters in an IRI, because | |||
| the escaping is based on iso-8859-1 rather than UTF-8. | ||||
| 1.3 Definitions | 1.3 Definitions | |||
| The following definitions are used in this document; they follow the | The following definitions are used in this document; they follow the | |||
| terms in [RFC2130], [RFC2277] and [ISO10646]: | terms in [RFC2130], [RFC2277] and [ISO10646]: | |||
| character: A member of a set of elements used for the | character: A member of a set of elements used for the | |||
| organization, control, or representation of data. For example, | organization, control, or representation of data. For example, | |||
| "LATIN CAPITAL LETTER A" names a character. | "LATIN CAPITAL LETTER A" names a character. | |||
| skipping to change at page 6, line 26 | skipping to change at page 6, line 26 | |||
| character encoding. | character encoding. | |||
| UCS: Universal Character Set; the coded character set defined by | UCS: Universal Character Set; the coded character set defined by | |||
| [ISO10646] and [UNIV3]. | [ISO10646] and [UNIV3]. | |||
| IRI reference: The term "IRI reference" denotes the common usage | IRI reference: The term "IRI reference" denotes the common usage | |||
| of an internationalized resource identifier. An IRI reference | of an internationalized resource identifier. An IRI reference | |||
| may be absolute or relative, and may have additional | may be absolute or relative, and may have additional | |||
| information attached in the form of a fragement identifier. | information attached in the form of a fragement identifier. | |||
| However, the "IRI" that results from such a reference only | However, the "IRI" that results from such a reference only | |||
| includes the absolute IRI after fragment identifier (if any) is | includes the absolute IRI after the fragment identifier (if | |||
| removed and after any relative IRI is resolved to its absolute | any) is removed and after any relative IRI is resolved to its | |||
| form. | absolute form. | |||
| 1.4 Notation | 1.4 Notation | |||
| RFCs and Internet Drafts currently do not allow any characters | ||||
| outside the US-ASCII repertoire. Therefore, this document uses | ||||
| various special notations to denote such characters. | ||||
| In text, characters outside US-ASCII are sometimes referenced by | In text, characters outside US-ASCII are sometimes referenced by | |||
| using a prefix of 'U+', followed by four to six hexadecimal digits. | using a prefix of 'U+', followed by four to six hexadecimal digits. | |||
| To represent characters outside US-ASCII in examples, this document | To represent characters outside US-ASCII in examples, this document | |||
| uses two notations called 'XML Notation' and 'Bidi Notation'. | uses two notations called 'XML Notation' and 'Bidi Notation'. | |||
| XML Notation uses leading '&#x', trailing ';', and the hexadecimal | XML Notation uses leading '&#x', trailing ';', and the hexadecimal | |||
| number of the character in the UCS in between. Example: Я stands | number of the character in the UCS in between. Example: я | |||
| for CYRILLIC CAPITAL LETTER YA. In this notation, an actual '&' is | stands for CYRILLIC CAPITAL LETTER YA. In this notation, an actual | |||
| denoted by '&'. | '&' is denoted by '&'. | |||
| Bidi Notation is used for bidirectional examples: lower case ASCII | Bidi Notation is used for bidirectional examples: lower case ASCII | |||
| letters stand for Latin letters or other letters that are written | letters stand for Latin letters or other letters that are written | |||
| left-to-right, whereas upper case letters represent Arabic or Hebrew | left-to-right, whereas upper case letters represent Arabic or Hebrew | |||
| letters that are written right-to-left. | letters that are written right-to-left. | |||
| 2. IRI Syntax | 2. IRI Syntax | |||
| This section defines the syntax of Internationalized Resource | This section defines the syntax of Internationalized Resource | |||
| Identifiers (IRIs). | Identifiers (IRIs). | |||
| skipping to change at page 7, line 50 | skipping to change at page 8, line 5 | |||
| because it is in the 'unreserved' category in URIs. | because it is in the 'unreserved' category in URIs. | |||
| 2.2 ABNF for IRI References and IRIs | 2.2 ABNF for IRI References and IRIs | |||
| While it might be possible to define IRI references and IRIs merely | While it might be possible to define IRI references and IRIs merely | |||
| by their transformation to URI references and URIs, they can also be | by their transformation to URI references and URIs, they can also be | |||
| accepted and processed directly. Therefore, an ABNF definition for | accepted and processed directly. Therefore, an ABNF definition for | |||
| IRI references (which are the most general concept and the start of | IRI references (which are the most general concept and the start of | |||
| the grammar) and IRIs is given here. The syntax of this ABNF is | the grammar) and IRIs is given here. The syntax of this ABNF is | |||
| described in [RFC2234]. Character numbers are taken from the UCS, | described in [RFC2234]. Character numbers are taken from the UCS, | |||
| without implying any actual binary encoding. | without implying any actual binary encoding. Terminals in the ABNF | |||
| are characters, not bytes. | ||||
| The following rules are different from [RFC2396]: | The following rules are different from [RFC2396]: | |||
| absolute-IRI-reference = absolute-IRI [ "#" ifragment ] | absolute-IRI-reference = absolute-IRI [ "#" ifragment ] | |||
| IRI-reference = [ absolute-IRI / relative-IRI ] | IRI-reference = [ absolute-IRI / relative-IRI ] | |||
| [ "#" ifragment ] | [ "#" ifragment ] | |||
| absolute-IRI = scheme ":" ( ihier-part / iopaque-part ) | absolute-IRI = scheme ":" ( ihier-part / iopaque-part ) | |||
| relative-IRI = [ inet-path / iabs-path / irel-path ] | relative-IRI = [ inet-path / iabs-path / irel-path ] | |||
| [ "?" iquery ] | [ "?" iquery ] | |||
| skipping to change at page 8, line 40 | skipping to change at page 8, line 43 | |||
| ireg-name = 1*( iunreserved / escaped / ";" / | ireg-name = 1*( iunreserved / escaped / ";" / | |||
| ":" / "@" / "&" / "=" / "+" / "$" / "," ) | ":" / "@" / "&" / "=" / "+" / "$" / "," ) | |||
| iserver = [ [ iuserinfo "@" ] ihostport ] | iserver = [ [ iuserinfo "@" ] ihostport ] | |||
| iuserinfo = *( iunreserved / escaped / ";" / | iuserinfo = *( iunreserved / escaped / ";" / | |||
| ":" / "&" / "=" / "+" / "$" / "," ) | ":" / "&" / "=" / "+" / "$" / "," ) | |||
| ihostport = ihost [ ":" port ] | ihostport = ihost [ ":" port ] | |||
| ihost = IPv6reference / IPv4address / ihostname | ihost = IPv6reference / IPv4address / ihostname | |||
| ihostname = << as specified by [RFCXXXX] >> | ihostname = idomainlabel [ iqualified] | |||
| iqualified = *( "." idomainlabel ) [ "." itoplabel [ "." ] ] | ||||
| idomainlabel = <<See following production rules>> | ||||
| itoplabel = <<See following production rules>> | ||||
| ipath = [ iabs-path / iopaque-part ] | ipath = [ iabs-path / iopaque-part ] | |||
| ipath-segments = isegment *( "/" isegment ) | ipath-segments = isegment *( "/" isegment ) | |||
| isegment = *ipchar | isegment = *ipchar | |||
| ipchar = iunreserved / escaped / ";" / | ipchar = iunreserved / escaped / ";" / | |||
| ":" / "@" / "&" / "=" / "+" / "$" / "," | ":" / "@" / "&" / "=" / "+" / "$" / "," | |||
| iquery = *( ipchar / "/" / "?" ) | iquery = *( ipchar / iprivate / "/" / "?" ) | |||
| ifragment = *( ipchar / "/" / "?" ) | ifragment = *( ipchar / "/" / "?" ) | |||
| iric = reserved / iunreserved / escaped | iric = reserved / iunreserved / escaped | |||
| iunreserved = ichar / unreserved | iunreserved = unreserved / ucschar / iadditional | |||
| ichar = idelims / ucschar / " " / "{" / "}" / "|" | iadditional = "<" / ">" / DQUOTE / SP / "{" / "}" / | |||
| / "\" / "^" / "`" | "|" / "\" / "^" / "`" | |||
| idelims = "<" / ">" / DQUOTE | ||||
| ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / | ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / | |||
| / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD | / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD | |||
| / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD | / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD | |||
| / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD | / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD | |||
| / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD | / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD | |||
| / %xD0000-DFFFD / %xE1000-EFFFD | / %xD0000-DFFFD / %xE1000-EFFFD | |||
| iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD | ||||
| The 'idomainlabel' and 'itoplabel' production rules are as follows: | ||||
| The values 'idomainlabel' and 'itoplabel' are defined as a string of | ||||
| 'ucschar' obeying the following rules: | ||||
| a) Given a string of 'ucschar' values, the ToASCII operation | ||||
| [RFCXXXX] is performed on that string with the flag | ||||
| UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set | ||||
| to FALSE for creating IRIs and set to TRUE otherwise. | ||||
| b) ToASCII is successful and results in a string conforming to | ||||
| 'domainlabel' for 'idomainlabel' and 'toplabel' for 'itoplabel' | ||||
| (see below for 'domainlabel' and 'toplabel'). | ||||
| Note that the space character and various delimiters are allowed in | Note that the space character and various delimiters are allowed in | |||
| IRIs and IRI references. This is further discussed in Section 5.1. | IRIs and IRI references. This is further discussed in Section 5.1. | |||
| The following are the same as [RFC2396bis]: | The following are the same as [RFC2396bis]: | |||
| scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | |||
| port = *DIGIT | port = *DIGIT | |||
| domainlabel = alphanum [ 0*61( alphanum | "-" ) alphanum ] | ||||
| toplabel = alpha [ 0*61( alphanum | "-" ) alphanum ] | ||||
| alphanum = ALPHA / DIGIT | alphanum = ALPHA / DIGIT | |||
| IPv4address = dec-octet 3( "." dec-octet ) | IPv4address = dec-octet 3( "." dec-octet ) | |||
| dec-octet = DIGIT / ; 0-9 | dec-octet = DIGIT / ; 0-9 | |||
| ( %x31-39 DIGIT ) / ; 10-99 | ( %x31-39 DIGIT ) / ; 10-99 | |||
| ( "1" 2*DIGIT ) / ; 100-199 | ( "1" 2*DIGIT ) / ; 100-199 | |||
| ( "2" %x30-34 DIGIT ) / ; 200-249 | ( "2" %x30-34 DIGIT ) / ; 200-249 | |||
| ( "25" %x30-35 ) ; 250-255 | ( "25" %x30-35 ) ; 250-255 | |||
| IPv6reference = "[" IPv6address "]" | IPv6reference = "[" IPv6address "]" | |||
| IPv6address = ( 7( h4 ":" ) h4 ) / | IPv6address = ( 7( h4 ":" ) h4 ) / | |||
| ( "::" 0*6( h4 ":" ) [ h4 ] ) / | ( "::" 0*6( h4 ":" ) [ h4 ] ) / | |||
| ( h4 "::" 0*5( h4 ":" ) [ h4 ] ) / | ( h4 "::" 0*5( h4 ":" ) [ h4 ] ) / | |||
| ( h4 ":" h4 "::" 0*4( h4 ":" ) [ h4 ] ) / | ( h4 ":" h4 "::" 0*4( h4 ":" ) [ h4 ] ) / | |||
| ( h4 2( ":" h4 ) "::" 0*3( h4 ":" ) [ h4 ] ) / | ( h4 2( ":" h4 ) "::" 0*3( h4 ":" ) [ h4 ] ) / | |||
| ( h4 3( ":" h4 ) "::" 0*2( h4 ":" ) [ h4 ] ) / | ( h4 3( ":" h4 ) "::" 0*2( h4 ":" ) [ h4 ] ) / | |||
| ( h4 4( ":" h4 ) "::" 0*1( h4 ":" ) [ h4 ] ) / | ( h4 4( ":" h4 ) "::" 0*1( h4 ":" ) [ h4 ] ) / | |||
| ( 6( h4 ":" ) IPv4address )/ | ( 6( h4 ":" ) IPv4address )/ | |||
| ( "::" 0*5( h4 ":" ) IPv4address )/ | ( "::" 0*5( h4 ":" ) IPv4address )/ | |||
| skipping to change at page 10, line 35 | skipping to change at page 10, line 21 | |||
| ( h4 4( ":" h4 ) "::" 0*1( h4 ":" ) [ h4 ] ) / | ( h4 4( ":" h4 ) "::" 0*1( h4 ":" ) [ h4 ] ) / | |||
| ( 6( h4 ":" ) IPv4address )/ | ( 6( h4 ":" ) IPv4address )/ | |||
| ( "::" 0*5( h4 ":" ) IPv4address )/ | ( "::" 0*5( h4 ":" ) IPv4address )/ | |||
| ( h4 "::" 0*4( h4 ":" ) IPv4address )/ | ( h4 "::" 0*4( h4 ":" ) IPv4address )/ | |||
| ( h4 ":" h4 "::" 0*3( h4 ":" ) IPv4address )/ | ( h4 ":" h4 "::" 0*3( h4 ":" ) IPv4address )/ | |||
| ( h4 2( ":" h4 ) "::" 0*2( h4 ":" ) IPv4address )/ | ( h4 2( ":" h4 ) "::" 0*2( h4 ":" ) IPv4address )/ | |||
| ( h4 3( ":" h4 ) "::" 0*1( h4 ":" ) IPv4address ) | ( h4 3( ":" h4 ) "::" 0*1( h4 ":" ) IPv4address ) | |||
| h4 = 1*4HEXDIG | h4 = 1*4HEXDIG | |||
| reserved = "[" / "]" / ";" / "/" / "?" / | reserved = "[" / "]" / ";" / "/" / "?" / | |||
| ":" / "@" / "&" / "=" / "+" / "$" / "," / | ":" / "@" / "&" / "=" / "+" / "$" / "," | |||
| unreserved = ALPHA / DIGIT / mark | unreserved = ALPHA / DIGIT / mark | |||
| mark = "-" / "_" / "." / "!" / "~" / "*" / "'" / | mark = "-" / "_" / "." / "!" / "~" / "*" / "'" / | |||
| "(" / ")" | "(" / ")" | |||
| escaped = "%" HEXDIG HEXDIG | escaped = "%" HEXDIG HEXDIG | |||
| 2.3 IRI Equivalence and Normalization | 2.3 IRI Equivalence and Normalization | |||
| There is no general rule or procedure to decide whether two arbitrary | There is no general rule or procedure to decide whether two arbitrary | |||
| IRIs are equivalent or not (i.e. refer to the same resource or not). | IRIs are equivalent or not (i.e. refer to the same resource or not). | |||
| Two IRIs that look almost the same may refer to different resources. | Two IRIs that look almost the same may refer to different resources. | |||
| Two IRIs that look completely different may refer to, and resolve to, | Two IRIs that look completely different may refer to, and resolve to, | |||
| the same resource. | the same resource. | |||
| In some scenarios, such as XML Namespaces ([XMLNamespace]), a | In some scenarios a definite answer to the question of IRI | |||
| definite answer to the question of IRI equivalence is needed that is | equivalence is needed that is independent of the scheme used and | |||
| independent of the scheme used and always can be calculated quickly | always can be calculated quickly and without accessing a network. An | |||
| and without accessing a network. In such cases, two IRIs SHOULD be | example of such a case might be XML Namespaces ([XMLNamespace]). In | |||
| defined as equivalent if and only if they are character-by-character | such cases, two IRIs SHOULD be defined as equivalent if and only if | |||
| equivalent. This is the same as being byte-by-byte equivalent if the | they are character-by-character equivalent. This is the same as | |||
| character encoding for both IRIs is the same. As an example, | being byte-by-byte equivalent if the character encoding for both IRIs | |||
| is the same. As an example, | ||||
| http://example.org/~user, http://example.org/%7euser, and | http://example.org/~user, http://example.org/%7euser, and | |||
| http://example.org/%7Euser would not be equivalent. In such a case, | http://example.org/%7Euser would not be equivalent under this | |||
| the comparison function MUST NOT map the IRIs to URIs. | definition. In such a case, the comparison function MUST NOT map the | |||
| IRIs to URIs, because such a mapping would create something different | ||||
| under this equivalence relationship. | ||||
| It follows from the above that IRIs SHOULD NOT be modified when being | It follows from the above that IRIs SHOULD NOT be modified when being | |||
| transported. | transported. | |||
| For actual resolution, differences in escaping (except for the | For actual resolution, differences in escaping (except for the | |||
| escaping of reserved characters) MUST always result in the same | escaping of reserved characters) MUST always result in the same | |||
| resource. For example, http://example.org/~user, | resource. For example, http://example.org/~user, | |||
| http://example.org/%7euser and http://example.org/%7Euser must | http://example.org/%7euser and http://example.org/%7Euser must | |||
| resolve to the same resource. If this kind of equivalence is to be | resolve to the same resource. If this kind of equivalence is to be | |||
| tested, the escaping of both IRIs to be compared has to be aligned, | tested, the escaping of both IRIs to be compared has to be aligned, | |||
| skipping to change at page 11, line 34 | skipping to change at page 11, line 24 | |||
| escape is always the same. Such conversions MUST only be done on the | escape is always the same. Such conversions MUST only be done on the | |||
| fly, without changing the original IRI. | fly, without changing the original IRI. | |||
| Specific schemes and resolution mechanisms may define additional | Specific schemes and resolution mechanisms may define additional | |||
| equivalences. For a specific scheme, two IRIs that e.g. differ only | equivalences. For a specific scheme, two IRIs that e.g. differ only | |||
| by case may be equivalent. However, this document does not deal with | by case may be equivalent. However, this document does not deal with | |||
| scheme-specific issues. | scheme-specific issues. | |||
| The Unicode Standard [UNIV3] defines various equivalences between | The Unicode Standard [UNIV3] defines various equivalences between | |||
| sequences of characters for various purposes. Unicode Standard Annex | sequences of characters for various purposes. Unicode Standard Annex | |||
| #15 [UNI15] defines various Normalization Forms for these | #15 [UTR15] defines various Normalization Forms for these | |||
| equivalences. IRIs SHOULD be created using Normalization Form C | equivalences. IRIs SHOULD be created using Normalization Form C | |||
| (NFC). Equivalence of IRIs MUST rely on the IRIs being appropriately | (NFC). Equivalence of IRIs MUST rely on the assumtion that IRIs are | |||
| pre-normalized, rather than applying normalization, except when | appropriately pre-normalized, rather than applying normalization when | |||
| converting from a non-UCS-based encoding to an UCS-based encoding, | comparing two IRIs, except when converting from a non-UCS-based | |||
| where a normalizing transcoder using NFC MUST be used. | encoding to an UCS-based encoding, where a normalizing transcoder | |||
| using NFC MUST be used for interoperability. | ||||
| As an example, http://www.example.org/résumé.html (in XML | As an example, http://www.example.org/résumé.html (in XML | |||
| Notation) is in NFC. On the other hand, http://www.example.org/ | Notation) is in NFC. On the other hand, http://www.example.org/ | |||
| résumé.html is not in NFC. The former uses precombined | résumé.html is not in NFC. The former uses precombined | |||
| e-acute characters, the later uses 'e' characters followed by | e-acute characters, the later uses 'e' characters followed by | |||
| combining acute accents, both are defined as canonically equivalent | combining acute accents, both are defined as canonically equivalent | |||
| in [UNIV3]. | in [UNIV3]. | |||
| Various IRI schemes may allow the usage of International Domain Names | Various IRI schemes may allow the usage of International Domain Names | |||
| (IDN) [RFCXXXX]. When in use in IRIs, those names SHOULD be | (IDN) [RFCXXXX]. When in use in IRIs, those names SHOULD be | |||
| skipping to change at page 12, line 46 | skipping to change at page 12, line 37 | |||
| b) Interpretational: URIs identify resources in various ways. | b) Interpretational: URIs identify resources in various ways. | |||
| IRIs also identify resources. When the IRI is used simply for | IRIs also identify resources. When the IRI is used simply for | |||
| identification purposes, it is not necessary to map the IRI to | identification purposes, it is not necessary to map the IRI to | |||
| an URI (see Section 2.3). However, when an IRI is used for | an URI (see Section 2.3). However, when an IRI is used for | |||
| resource retrieval, the resource that the IRI locates is the | resource retrieval, the resource that the IRI locates is the | |||
| same as the one located by the URI obtained after converting | same as the one located by the URI obtained after converting | |||
| the IRI according to the procedure defined here. This means | the IRI according to the procedure defined here. This means | |||
| that there is no need to define resolution separately on the | that there is no need to define resolution separately on the | |||
| IRI level. | IRI level. | |||
| This mapping is accomplished in two steps. | Applications MUST map IRIs to URIs using the following two steps. | |||
| Step 1) This step generates a UCS-based encoding from the original | Step 1) This step generates a UCS-based encoding from the original | |||
| IRI format. This step has three variants, depending on the | IRI format. This step has three variants, depending on the | |||
| form of the input. | form of the input. | |||
| Variant A) If the IRI is written on paper or read out loud, | Variant A) If the IRI is written on paper or read out loud, | |||
| or otherwise represented as a sequence of characters | or otherwise represented as a sequence of characters | |||
| independent of any encoding: Represent the IRI as a | independent of any encoding: Represent the IRI as a | |||
| sequence of characters from the UCS normalized according | sequence of characters from the UCS normalized according | |||
| to Normalization Form C (NFC, [UNI15]). | to Normalization Form C (NFC, [UTR15]). | |||
| Variant B) If the IRI is in some digital representation | Variant B) If the IRI is in some digital representation | |||
| (e.g. an octet stream) in some non-Unicode encoding: | (e.g. an octet stream) in some non-Unicode encoding: | |||
| Convert the IRI to a sequence of characters from the UCS | Convert the IRI to a sequence of characters from the UCS | |||
| normalized according to NFC. | normalized according to NFC. | |||
| Variant C) If the IRI is in an Unicode-based encoding (for | Variant C) If the IRI is in an Unicode-based encoding (for | |||
| example UTF-8 or UTF-16): Do not normalize. Move | example UTF-8 or UTF-16): Do not normalize. Move | |||
| directly to Step 2. | directly to Step 2. | |||
| Step 2) For each character that is disallowed in URI references, | Step 2) For each character that is disallowed in URI references, | |||
| apply steps 1) through 3) below. The disallowed characters | apply steps 1) through 3) below. The disallowed characters | |||
| consist of all non-ASCII characters, plus the excluded | consist of all non-ASCII characters, plus the excluded | |||
| characters listed in Section 2.4 of [RFC2396], except for the | characters listed in Section 2.4 of [RFC2396], except for the | |||
| number sign (#) and percent sign (%) and the square bracket | number sign (#) and percent sign (%) and the square bracket | |||
| characters re-allowed in [RFC2732]. | characters re-allowed in [RFC2732]. | |||
| 1) Convert the character to a sequence of one or more octets | 1) Convert the character to a sequence of one or more octets | |||
| using UTF-8 [RFC2279]. | using UTF-8 [RFC2279]. | |||
| 2) Convert each octet to %hh, where hh is the hexadecimal | 2) Convert each octet to %HH, where HH is the hexadecimal | |||
| notation of the octet value. Note: This is identical to | notation of the octet value. Note: This is identical to | |||
| the escaping mechanism in Section 2.4.1 of [RFC2396]. | the escaping mechanism in Section 2.4.1 of [RFC2396]. | |||
| Note: To reduce variability, the hexadecimal notation | Note: To reduce variability, the hexadecimal notation | |||
| SHOULD use lower case letters. | SHOULD use upper case letters. | |||
| 3) Replace the original character by the resulting character | 3) Replace the original character by the resulting character | |||
| sequence. | sequence (i.e. a sequence of %HH triplets). | |||
| Note that in this process (in step 2.3), characters allowed in URI | Note that in this process (in step 2.3), characters allowed in URI | |||
| references and existing escape sequences are not escaped further. | references and existing escape sequences are not escaped further. | |||
| (This mapping is similar to, but different from, the escaping applied | (This mapping is similar to, but different from, the escaping applied | |||
| when including arbitrary content into some part of a URI.) For | when including arbitrary content into some part of a URI.) For | |||
| example, an IRI of | example, an IRI of | |||
| http://www.example.org/red%09rosé#<red> (in XML notation) is | http://www.example.org/red%09rosé#<red> (in XML notation) is | |||
| converted to | converted to | |||
| http://www.example.org/red%09ros%c3%a9#%3cred%3e, not to something | http://www.example.org/red%09ros%C3%A9#%3Cred%3E, not to something | |||
| like | like | |||
| http%3a%2f%2fwww.example.org%2fred%2509ros%c3%a9%23red. | http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red. | |||
| Note that some older software transcoding to UTF-8 may produce | Note that some older software transcoding to UTF-8 may produce | |||
| illegal output for some input, in particular for characters outside | illegal output for some input, in particular for characters outside | |||
| the BMP (Basic Multilingual Plane). As an example, for the following | the BMP (Basic Multilingual Plane). As an example, for the following | |||
| IRI with non-BMP characters (in XML Notation): | IRI with non-BMP characters (in XML Notation): | |||
| http://example.com/ | http://example.com/ | |||
| (the first three letters of the Old Italic alphabet) the correct | (the first three letters of the Old Italic alphabet) the correct | |||
| conversion to a URI is: | conversion to a URI is: | |||
| http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82 | http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82 | |||
| skipping to change at page 14, line 46 | skipping to change at page 14, line 37 | |||
| a) Some escape sequences are necessary to distinguish escaped and | a) Some escape sequences are necessary to distinguish escaped and | |||
| unescaped uses of reserved characters. | unescaped uses of reserved characters. | |||
| b) Some escape sequences cannot be interpreted as sequences of | b) Some escape sequences cannot be interpreted as sequences of | |||
| UTF-8 octets. | UTF-8 octets. | |||
| (Note: Due to the regularities in the octet patterns of UTF-8, | (Note: Due to the regularities in the octet patterns of UTF-8, | |||
| there is a very high probability, but no guarantee, that escape | there is a very high probability, but no guarantee, that escape | |||
| sequences that can be interpreted as sequences of UTF-8 octets | sequences that can be interpreted as sequences of UTF-8 octets | |||
| actually originated from UTF-8. For a detailed discussion, see | actually originated from UTF-8. For a detailed discussion, see | |||
| [Duer97].) | [Duerst97].) | |||
| c) The conversion may result in a character that is not | c) The conversion may result in a character that is not | |||
| appropriate in an IRI. See Section 5.1 for further details. | appropriate in an IRI. See Section 5.1 for further details. | |||
| Conversion from a URI to an IRI is done using the following steps (or | Conversion from a URI to an IRI is done using the following steps (or | |||
| any other algorithm that produces the same result): | any other algorithm that produces the same result): | |||
| 1) Represent the URI as a sequence of octets in US-ASCII. | 1) Represent the URI as a sequence of octets in US-ASCII. | |||
| 2) Convert all hexadecimal escapes (% followed by two hexadecimal | 2) Convert all hexadecimal escapes (% followed by two hexadecimal | |||
| digits) except those corresponding to '#' and '%' and | digits) except those corresponding to '#' and '%' and | |||
| characters in 'reserved', to the corresponding octets. | characters in 'reserved', to the corresponding octets. | |||
| 3) Re-escape any octets that are not part of a strictly legal UTF- | 3) Re-escape any octet produced in step 2) that is not part of a | |||
| 8 octet sequence. | strictly legal UTF-8 octet sequence. | |||
| 4) Re-escape all octets that in UTF-8 represent characters that | 4) Re-escape all octets produced in step 2) that in UTF-8 | |||
| are not appropriate according to Section 5.1. | represent characters that are not appropriate according to | |||
| Section 4.1 and Section 5.1. | ||||
| 5) Interpret the resulting octet sequence as a sequence of | 5) Interpret the resulting octet sequence as a sequence of | |||
| characters encoded in UTF-8. | characters encoded in UTF-8. | |||
| This procedure will convert as many escaped non-ASCII characters as | This procedure will convert as many escaped non-ASCII characters as | |||
| possible to characters in an IRI. Because there are some choices | possible to characters in an IRI. Because there are some choices | |||
| when applying step 4) (see Section 5.1), results may differ. | when applying step 4) (see Section 5.1), results may differ. | |||
| Conversions from URIs to IRIs MUST NOT use any other encoding than | ||||
| UTF-8 in steps 3) and 4) above, even if it might be possible from | ||||
| context to guess that another encoding than UTF-8 was used in the | ||||
| URI. As an example, the URI http://www.example.org/r%E9sum%E9.html, | ||||
| which with some guesses might be interpreted to contain two e-acute | ||||
| characters encoded as iso-8859-1, must not be converted to an IRI | ||||
| containing these e-acute characters. Otherwise, the IRI will in the | ||||
| future be mapped to http://www.example.org/r%C3%A9sum%C3%A9.html, | ||||
| which is a different URI from http://www.example.org/r%E9sum%E9.html. | ||||
| 3.2.1 Examples | ||||
| This section shows various examples of converting URIs to IRIs. The | ||||
| notation <hh> is used to denote octets outside those that can be | ||||
| represented in this document. Each example shows the result after | ||||
| applying each of the steps 1) to 5). XML Notation is used for the | ||||
| final result. | ||||
| The following example contains the sequence '%C3%BC', which is a | ||||
| strictly legal UTF-8 sequence, and which is converted into the actual | ||||
| character U+00FC LATIN SMALL LETTER U WITH DIAERESIS (also known as | ||||
| u-umlaut). | ||||
| 1) http://www.example.org/D%C3%BCrst | ||||
| 2) http://www.example.org/D<c3><bc>rst | ||||
| 3) http://www.example.org/D<c3><bc>rst | ||||
| 4) http://www.example.org/D<c3><bc>rst | ||||
| 5) http://www.example.org/Dürst | ||||
| The following example contains the sequence '%FC', which might | ||||
| represent U+00FC LATIN SMALL LETTER U WITH DIAERESIS in the iso-8859- | ||||
| 1 encoding. (It might represent other characters in other encodings. | ||||
| For example, the octet <FC> in iso-8859-5 represents U+045C CYRILLIC | ||||
| SMALL LETTER KJE.) Because <FC> is not part of a strictly legal UTF-8 | ||||
| sequence, it is re-escaped in step 2). | ||||
| 1) http://www.example.org/D%FCrst | ||||
| 2) http://www.example.org/D<FC>rst | ||||
| 3) http://www.example.org/D%FCrst | ||||
| 4) http://www.example.org/D%FCrst | ||||
| 5) http://www.example.org/D%FCrst | ||||
| The following example contains '%e2%80%ae', which is the escaped UTF- | ||||
| 8 encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. Section 4.1 forbids | ||||
| the direct use of this character in an IRI. Therefore, the | ||||
| corresponding octets are re-escaped in step 3). This example shows | ||||
| that the case (upper or lower) of letters used in escapes may not be | ||||
| preserved. | ||||
| 1) http://www.example.org/%e2%80%ae | ||||
| 2) http://www.example.org/<E2><80><AE> | ||||
| 3) http://www.example.org/<E2><80><AE> | ||||
| 4) http://www.example.org/%E2%80%AE | ||||
| 5) http://www.example.org/%E2%80%AE | ||||
| 4. Bidirectional IRIs for Right-to-left Languages | 4. Bidirectional IRIs for Right-to-left Languages | |||
| Some UCS characters, such as those used in the Arabic and Hebrew | Some UCS characters, such as those used in the Arabic and Hebrew | |||
| script, have an inherent right-to-left writing direction. IRIs | script, have an inherent right-to-left writing direction. IRIs | |||
| containing such characters (called bidirectional IRIs or Bidi IRIs) | containing such characters (called bidirectional IRIs or Bidi IRIs) | |||
| require additional attention because of the non-trivial relation | require additional attention because of the non-trivial relation | |||
| between logical representation (used for digital representation as | between logical representation (used for digital representation as | |||
| well as when reading/spelling) and visual representation (used for | well as when reading/spelling) and visual representation (used for | |||
| display/printing). | display/printing). | |||
| skipping to change at page 24, line 6 | skipping to change at page 25, line 18 | |||
| Outside of the US-ASCII range, there are many more opportunities for | Outside of the US-ASCII range, there are many more opportunities for | |||
| confusion; a complete set of guidelines is too lengthy to include | confusion; a complete set of guidelines is too lengthy to include | |||
| here. As long as names are limited to characters from a single | here. As long as names are limited to characters from a single | |||
| script, native writers of a given script or language will know best | script, native writers of a given script or language will know best | |||
| when ambiguities can appear, and how they can be avoided. What may | when ambiguities can appear, and how they can be avoided. What may | |||
| look ambiguous to a stranger may be completely obvious to the average | look ambiguous to a stranger may be completely obvious to the average | |||
| native user. On the other hand, in some cases, the UCS contains | native user. On the other hand, in some cases, the UCS contains | |||
| variants for compatibility reasons, for example for typographic | variants for compatibility reasons, for example for typographic | |||
| purposes. These should be avoided wherever possible. Although there | purposes. These should be avoided wherever possible. Although there | |||
| may be exceptions, in general newly created resource names should be | may be exceptions, in general newly created resource names should be | |||
| in NFKC [UNI15] (which means that they are also in NFC). | in NFKC [UTR15] (which means that they are also in NFC). | |||
| As an example, the UCS contains a codepoint for the 'fi' ligature. | As an example, the UCS contains codepoint U+FB01 for the 'fi' | |||
| Wherever possible, IRIs should use the two letters 'f' and 'i' rather | ligature for compatibility reasons. Wherever possible, IRIs should | |||
| than the 'fi' ligature. An example where the later may be used is in | use the two letters 'f' and 'i' rather than the 'fi' ligature. An | |||
| the query part of an IRI for an explicit search for a word containing | example where the later may be used is in the query part of an IRI | |||
| the 'fi' ligature. | for an explicit search for a word containing the 'fi' ligature. | |||
| In certain cases, there is a chance that characters from different | In certain cases, there is a chance that characters from different | |||
| scripts look the same. The best known example is the Latin 'A', the | scripts look the same. The best known example is the Latin 'A', the | |||
| Greek 'Alpha', and the Cyrillic 'A'. To avoid such cases, only IRIs | Greek 'Alpha', and the Cyrillic 'A'. To avoid such cases, only IRIs | |||
| should be generated where all the characters in a single component | should be generated where all the characters in a single component | |||
| are used together in a given language. This usually means that all | are used together in a given language. This usually means that all | |||
| these characters will be from the same script, but there are | these characters will be from the same script, but there are | |||
| languages that mix characters from different scripts (such as | languages that mix characters from different scripts (such as | |||
| Japanese). This is similar to the heuristics used to distinguish | Japanese). This is similar to the heuristics used to distinguish | |||
| between letters and numbers in the examples above. Also, for Latin, | between letters and numbers in the examples above. Also, for Latin, | |||
| skipping to change at page 25, line 17 | skipping to change at page 26, line 30 | |||
| how currently some servers treat URIs as case-insensitive, or perform | how currently some servers treat URIs as case-insensitive, or perform | |||
| additional matching to account for spelling errors. For characters | additional matching to account for spelling errors. For characters | |||
| beyond the ASCII repertoire, this may for example include ignoring | beyond the ASCII repertoire, this may for example include ignoring | |||
| the accents on received IRIs or resource names where appropriate. | the accents on received IRIs or resource names where appropriate. | |||
| Please note that such mappings, including case mappings, are | Please note that such mappings, including case mappings, are | |||
| language-dependent. | language-dependent. | |||
| It can be difficult to unambiguously identify a resource if too many | It can be difficult to unambiguously identify a resource if too many | |||
| mappings are taken into consideration. However, escaped and non- | mappings are taken into consideration. However, escaped and non- | |||
| escaped parts of IRIs can always clearly be distinguished. Also, the | escaped parts of IRIs can always clearly be distinguished. Also, the | |||
| regularity of UTF-8 (see [Duer97]) makes the potential for collisions | regularity of UTF-8 (see [Duerst97]) makes the potential for | |||
| lower than it may seem at first sight. | collisions lower than it may seem at first sight. | |||
| 6.8 Upgrading Strategy | 6.8 Upgrading Strategy | |||
| Where this recommendation places further constraints on software for | Where this recommendation places further constraints on software for | |||
| which many instances are already deployed, it is important to | which many instances are already deployed, it is important to | |||
| introduce upgrades carefully, and to be aware of the various | introduce upgrades carefully, and to be aware of the various | |||
| interdependencies. | interdependencies. | |||
| If IRIs cannot be interpreted correctly, they should not be generated | If IRIs cannot be interpreted correctly, they should not be generated | |||
| or transported. This suggests that upgrading URI interpreting | or transported. This suggests that upgrading URI interpreting | |||
| skipping to change at page 26, line 32 | skipping to change at page 27, line 45 | |||
| similar, but may contain all kinds of changes that may be difficult | similar, but may contain all kinds of changes that may be difficult | |||
| to spot but can cause all kinds of problems. Most spoofing | to spot but can cause all kinds of problems. Most spoofing | |||
| possibilities for IRIs are extensions of those for URIs. | possibilities for IRIs are extensions of those for URIs. | |||
| Spoofing can occur for various reasons. A first reason is that | Spoofing can occur for various reasons. A first reason is that | |||
| normalization expectations of a user or actual normalization when | normalization expectations of a user or actual normalization when | |||
| entering an IRI do not match the normalization used on the server | entering an IRI do not match the normalization used on the server | |||
| side. Conceptually, this is no different from the problems | side. Conceptually, this is no different from the problems | |||
| surrounding the use of case-insensitive web servers. For example, a | surrounding the use of case-insensitive web servers. For example, a | |||
| popular web page with a mixed case name (http://big.site/ | popular web page with a mixed case name (http://big.site/ | |||
| PopularPage.html) might be "spoofed" by someone who obtains access to | PopularPage.html) might be "spoofed" by someone who is able to create | |||
| http://big.site/popularpage.html. However, the introduction of | http://big.site/popularpage.html. However, the introduction of | |||
| character normalization, and of additional mappings for user | character normalization, and of additional mappings for user | |||
| convenience, may increase the chance for spoofing. | convenience, may increase the chance for spoofing. | |||
| Spoofing can occur because in the UCS, there are many characters that | Spoofing can occur because in the UCS, there are many characters that | |||
| look very similar. Details are discussed in Section 6.5. Again, | look very similar. Details are discussed in Section 6.5. Again, | |||
| this is very similar to spoofing possibilities on US-ASCII, e.g. | this is very similar to spoofing possibilities on US-ASCII, e.g. | |||
| using 'br0ken' or '1ame' URIs. | using 'br0ken' or '1ame' URIs. | |||
| Spoofing can occur when URIs in various encodings are accepted to | Spoofing can occur when URIs in various encodings are accepted to | |||
| deal with older user agents. In some cases, in particular for Latin- | deal with older user agents. In some cases, in particular for Latin- | |||
| based resource names, this is usually easy to detect because UTF-8- | based resource names, this is usually easy to detect because UTF-8- | |||
| encoded names, when interpreted and viewed as legacy encodings, | encoded names, when interpreted and viewed as legacy encodings, | |||
| produce mostly garbage. In other cases, when concurrently used | produce mostly garbage. In other cases, when concurrently used | |||
| encodings have a similar structure, but there are no characters that | encodings have a similar structure, but there are no characters that | |||
| have exactly the same encoding, detection is more difficult. | have exactly the same encoding, detection is more difficult. | |||
| skipping to change at page 27, line 14 | skipping to change at page 28, line 27 | |||
| part, see [Nameprep]. For the path part, administrators of sites | part, see [Nameprep]. For the path part, administrators of sites | |||
| which allow independent users to create resources in the same subarea | which allow independent users to create resources in the same subarea | |||
| may need to be careful to check for spoofing. | may need to be careful to check for spoofing. | |||
| Spoofing can occur with bidirectional IRIs, if the restrictions in | Spoofing can occur with bidirectional IRIs, if the restrictions in | |||
| Section 4.2 are not followed. The same visual representation may be | Section 4.2 are not followed. The same visual representation may be | |||
| interpreted as different logical representations, and vice versa. It | interpreted as different logical representations, and vice versa. It | |||
| is also very important that a correct Unicode bidirectional | is also very important that a correct Unicode bidirectional | |||
| implementation is used. | implementation is used. | |||
| 8. Change log | 8. Issues List | |||
| 8.1 Changes from -01 to -02 | - Should characters in iadditional be allowed? Under what | |||
| conditions?. | ||||
| - Allign the description in Section 2.3 with the results of W3C | ||||
| TAG discussions on issue URIEquivalence. | ||||
| - Adapt depending on how [IDNURI] is integrated into | ||||
| [RFC2396bis]. | ||||
| 9. Change log | ||||
| 9.1 Changes from -02 to -03 | ||||
| - Added an issues list. | ||||
| - Added a paragraph prohibiting conversions from URIs to IRIs not | ||||
| based on UTF-8 to Section 3.2. | ||||
| - Introduced iadditional to combine unwise, delims, and space. | ||||
| - Tweaked description and added examples for URI-to-IRI | ||||
| conversion. | ||||
| - Improved syntax rules for hostname part. | ||||
| - Improved description of equivalences in Section 2.3. | ||||
| - Improved description of URI-to-IRI-mapping in Section 3.2. | ||||
| - Changed preferred case when hex-escaping from lower to UPPER. | ||||
| - Fixed various details. | ||||
| 9.2 Changes from -01 to -02 | ||||
| - New approach for Bidi section, many examples. | - New approach for Bidi section, many examples. | |||
| - Created idelims, removed '%' and '#'. Changed userinfo to | - Created idelims, removed '%' and '#'. Changed userinfo to | |||
| iuserinfo in iserver. | iuserinfo in iserver. | |||
| - Changed to ABNF defined by [RFC2234]. | - Changed to ABNF defined by [RFC2234]. | |||
| - Included bug fixes from [RFC2396bis]. | - Included bug fixes from [RFC2396bis]. | |||
| - Additions to Acknowledgements. | - Additions to Acknowledgements. | |||
| 8.2 Changes from -00 to -01 | 9.3 Changes from -00 to -01 | |||
| - Re-integrated the section on Bidi, some issues left. | - Re-integrated the section on Bidi, some issues left. | |||
| - Integrated IDN, changed syntax (host, userinfo,....). | - Integrated IDN, changed syntax (host, userinfo,....). | |||
| - Moved some text around, marked some as informational. | - Moved some text around, marked some as informational. | |||
| - Made a clear distinction of IRI use for identification only and | - Made a clear distinction of IRI use for identification only and | |||
| for resource resolution. | for resource resolution. | |||
| - Fixed various details in wording, spelling,... | - Fixed various details in wording, spelling,... | |||
| 9. Acknowledgements | 10. Acknowledgements | |||
| We would like to thank Larry Masinter for his work as coauthor of | We would like to thank Larry Masinter for his work as coauthor of | |||
| many earlier versions of this document (draft-masinter-url-i18n-xx). | many earlier versions of this document (draft-masinter-url-i18n-xx). | |||
| The discussion on the issue addressed here has started a long time | The discussion on the issue addressed here has started a long time | |||
| ago. There was a thread in the HTML working group in August 1995 | ago. There was a thread in the HTML working group in August 1995 | |||
| (under the topic of "Globalizing URIs") and in the www-international | (under the topic of "Globalizing URIs") and in the www-international | |||
| mailing list in July 1996 (under the topic of "Internationalization | mailing list in July 1996 (under the topic of "Internationalization | |||
| and URLs"), and ad-hoc meetings at the Unicode conferences in | and URLs"), and ad-hoc meetings at the Unicode conferences in | |||
| September 1995 and September 1997. | September 1995 and September 1997. | |||
| Thanks to Francois Yergeau, Matti Allouche, Roy Fielding, Tim | Thanks to Francois Yergeau, Matti Allouche, Roy Fielding, Tim | |||
| Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim | Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim | |||
| Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie | Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie | |||
| Daigle, Ted Hardie, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex | Daigle, Ted Hardie, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex | |||
| Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilly, Dan Oscarson, | Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Dan Oscarson, | |||
| Elliotte Rusty Harold, Mike J. Brown, Carlos Viegas Damasio, and | Elliotte Rusty Harold, Mike J. Brown, Carlos Viegas Damasio, and | |||
| many others for help with understanding the issues and possible | many others for help with understanding the issues and possible | |||
| solutions, and getting the details right. Thanks also to the members | solutions, and getting the details right. Thanks also to the members | |||
| of the W3C I18N Working Group and Interest Group for their | of the W3C I18N Working Group and Interest Group for their | |||
| contributions and their work on [CharMod], to the members of many | contributions and their work on [CharMod], to the members of many | |||
| other W3C WGs for adopting the ideas, and to the members of the | other W3C WGs for adopting the ideas, and to the members of the | |||
| Montreal IAB Workshop on Internationalization and Localization for | Montreal IAB Workshop on Internationalization and Localization for | |||
| their review. | their review. | |||
| Normative References | Normative References | |||
| skipping to change at page 28, line 50 | skipping to change at page 30, line 49 | |||
| [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for | [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for | |||
| Literal IPv6 Addresses in URL's", RFC 2732, December | Literal IPv6 Addresses in URL's", RFC 2732, December | |||
| 1999. | 1999. | |||
| [RFCXXXX] Faltstrom, P., Hoffman, P. and A. Costello, | [RFCXXXX] Faltstrom, P., Hoffman, P. and A. Costello, | |||
| "Internationalizing Domain Names in Applications (IDNA)", | "Internationalizing Domain Names in Applications (IDNA)", | |||
| draft-ietf-idn-idna-14.txt (work in progress), October | draft-ietf-idn-idna-14.txt (work in progress), October | |||
| 2002, <http://www.ietf.org/internet-drafts/draft-ietf- | 2002, <http://www.ietf.org/internet-drafts/draft-ietf- | |||
| idn-idna-14.txt>. | idn-idna-14.txt>. | |||
| [UNI15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | |||
| Unicode Standard Annex #15, March 2001, <http:// | Unicode Standard Annex #15, March 2001, <http:// | |||
| www.unicode.org/unicode/reports/tr15/tr15-21.html>. | www.unicode.org/unicode/reports/tr15/tr15-21.html>. | |||
| Non-normative References | Non-normative References | |||
| [BidiEx] "Examples of bidirectional IRIs", <http://www.w3.org/ | [BidiEx] "Examples of bidirectional IRIs", <http://www.w3.org/ | |||
| International/iri-edit/BidiExamples>. | International/iri-edit/BidiExamples>. | |||
| [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., | [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., | |||
| Freytag, A. and T. Texin, "Character Model for the | Freytag, A. and T. Texin, "Character Model for the | |||
| World Wide Web", World Wide Web Consortium Working | World Wide Web", World Wide Web Consortium Working | |||
| Draft, April 2002, <http://www.w3.org/TR/charmod>. | Draft, April 2002, <http://www.w3.org/TR/charmod>. | |||
| [Duer97] Duerst, M., "The Properties and Promises of UTF-8", | [Duerst97] Duerst, M., "The Properties and Promises of UTF-8", | |||
| Proc. 11th International Unicode Conference, San Jose | Proc. 11th International Unicode Conference, San Jose | |||
| , September 1997, <http://www.ifi.unizh.ch/mml/ | , September 1997, <http://www.ifi.unizh.ch/mml/ | |||
| mduerst/papers/PDF/IUC11-UTF-8.pdf>. | mduerst/papers/PDF/IUC11-UTF-8.pdf>. | |||
| [Duer01] Duerst, M., "Internationalized Resource Identifiers: | [Duerst01] Duerst, M., "Internationalized Resource Identifiers: | |||
| From Specification to Testing", Proc. 19th | From Specification to Testing", Proc. 19th | |||
| International Unicode Conference, San Jose , | International Unicode Conference, San Jose , | |||
| September 2001, <http://www.w3.org/2001/Talks/0912- | September 2001, <http://www.w3.org/2001/Talks/0912- | |||
| IUC-IRI/paper.html>. | IUC-IRI/paper.html>. | |||
| [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 | [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 | |||
| Specification", World Wide Web Consortium | Specification", World Wide Web Consortium | |||
| Recommendation, December 1999, <http://www.w3.org/TR/ | Recommendation, December 1999, <http://www.w3.org/TR/ | |||
| REC-html40/appendix/notes.html#h-B.2>. | REC-html40/appendix/notes.html#h-B.2>. | |||
| [IDNURI] Duerst, M., "Internationalized Domain Names in URIs", | [IDNURI] Duerst, M., "Internationalized Domain Names in URIs", | |||
| draft-ietf-idn-uri-03.txt (work in progress), July | draft-ietf-idn-uri-03.txt (work in progress), | |||
| 2002, <http://www.ietf.org/internet-drafts/draft- | November 2002, <http://www.ietf.org/internet-drafts/ | |||
| ietf-idn-uri-03.txt>. | draft-ietf-idn-uri-03.txt>. | |||
| [Nameprep] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | [Nameprep] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | |||
| Profile for Internationalized Domain Names", draft- | Profile for Internationalized Domain Names", draft- | |||
| ietf-idn-nameprep-11.txt (work in progress), June | ietf-idn-nameprep-11.txt (work in progress), June | |||
| 2002, <http://www.ietf.org/internet-drafts/draft- | 2002, <http://www.ietf.org/internet-drafts/draft- | |||
| ietf-idn-nameprep-11.txt>. | ietf-idn-nameprep-11.txt>. | |||
| [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
| Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
| skipping to change at page 31, line 23 | skipping to change at page 33, line 23 | |||
| XML", World Wide Web Consortium Recommendation, | XML", World Wide Web Consortium Recommendation, | |||
| January 1999, <http://www.w3.org/TR/REC-xml#sec- | January 1999, <http://www.w3.org/TR/REC-xml#sec- | |||
| external-ent>. | external-ent>. | |||
| [XMLSchema] Biron, P. and A. Malhotra, "XML Schema Part 2: | [XMLSchema] Biron, P. and A. Malhotra, "XML Schema Part 2: | |||
| Datatypes", World Wide Web Consortium Recommendation, | Datatypes", World Wide Web Consortium Recommendation, | |||
| May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>. | May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>. | |||
| Authors' Addresses | Authors' Addresses | |||
| Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever possible, for example as "Dürst in XML and HTML.) | Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever | |||
| possible, for example as "Dürst in XML and HTML.) | ||||
| World Wide Web Consortium | World Wide Web Consortium | |||
| 200 Technology Square | 200 Technology Square | |||
| Cambridge, MA 02139 | Cambridge, MA 02139 | |||
| U.S.A. | U.S.A. | |||
| Phone: +1 617 253 5509 | Phone: +1 617 253 5509 | |||
| Fax: +1 617 258 5999 | Fax: +1 617 258 5999 | |||
| EMail: duerst@w3.org | EMail: duerst@w3.org | |||
| URI: http://www.w3.org/People/D%C3%BCrst/ | URI: http://www.w3.org/People/D%C3%BCrst/ | |||
| (Note: This is the escaped form of an IRI.) | (Note: This is the escaped form of an IRI.) | |||
| skipping to change at page 32, line 7 | skipping to change at page 34, line 7 | |||
| One Microsoft Way | One Microsoft Way | |||
| Redmond, WA 98052 | Redmond, WA 98052 | |||
| U.S.A. | U.S.A. | |||
| Phone: +1 425 882-8080 | Phone: +1 425 882-8080 | |||
| EMail: mailto:michelsu@microsoft.com | EMail: mailto:michelsu@microsoft.com | |||
| URI: http://www.suignard.com | URI: http://www.suignard.com | |||
| Full Copyright Statement | Full Copyright Statement | |||
| Copyright (C) The Internet Society (2002). All Rights Reserved. | Copyright (C) The Internet Society (2003). All Rights Reserved. | |||
| This document and translations of it may be copied and furnished to | This document and translations of it may be copied and furnished to | |||
| others, and derivative works that comment on or otherwise explain it | others, and derivative works that comment on or otherwise explain it | |||
| or assist in its implementation may be prepared, copied, published | or assist in its implementation may be prepared, copied, published | |||
| and distributed, in whole or in part, without restriction of any | and distributed, in whole or in part, without restriction of any | |||
| kind, provided that the above copyright notice and this paragraph are | kind, provided that the above copyright notice and this paragraph are | |||
| included on all such copies and derivative works. However, this | included on all such copies and derivative works. However, this | |||
| document itself may not be modified in any way, such as by removing | document itself may not be modified in any way, such as by removing | |||
| the copyright notice or references to the Internet Society or other | the copyright notice or references to the Internet Society or other | |||
| Internet organizations, except as needed for the purpose of | Internet organizations, except as needed for the purpose of | |||
| End of changes. | ||||
This html diff was produced by rfcdiff 1.12, available from http://www.levkowetz.com/ietf/tools/rfcdiff/ | ||||