| draft-duerst-iri-04.txt | draft-duerst-iri-05.txt | |||
|---|---|---|---|---|
| Network Working Group M. Duerst | Network Working Group M. Duerst | |||
| Internet-Draft W3C | Internet-Draft W3C | |||
| Expires: December 28, 2003 M. Suignard | Expires: April 25, 2004 M. Suignard | |||
| Microsoft Corporation | Microsoft Corporation | |||
| June 29, 2003 | October 26, 2003 | |||
| Internationalized Resource Identifiers (IRIs) | Internationalized Resource Identifiers (IRIs) | |||
| draft-duerst-iri-04 | draft-duerst-iri-05 | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with | This document is an Internet-Draft and is in full conformance with | |||
| all provisions of Section 10 of RFC2026. | all provisions of Section 10 of RFC2026. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
| other groups may also distribute working documents as Internet- | other groups may also distribute working documents as Internet- | |||
| Drafts. | Drafts. | |||
| skipping to change at page 1, line 33 | skipping to change at page 1, line 33 | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at http:// | The list of current Internet-Drafts can be accessed at http:// | |||
| www.ietf.org/ietf/1id-abstracts.txt. | www.ietf.org/ietf/1id-abstracts.txt. | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
| This Internet-Draft will expire on December 28, 2003. | This Internet-Draft will expire on April 25, 2004. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (2003). All Rights Reserved. | Copyright (C) The Internet Society (2003). All Rights Reserved. | |||
| Abstract | Abstract | |||
| This document defines a new protocol element, the Internationalized | This document defines a new protocol element, the Internationalized | |||
| Resource Identifier (IRI), as a complement to the URI [RFCYYYY]. An | Resource Identifier (IRI), as a complement to the URI [RFCYYYY]. An | |||
| IRI is a sequence of characters from the Universal Character Set | IRI is a sequence of characters from the Universal Character Set | |||
| skipping to change at page 2, line 31 | skipping to change at page 2, line 31 | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . . 4 | 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . . 4 | |||
| 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . . 4 | 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 5 | 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . . 7 | 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . . 7 | |||
| 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . . 7 | 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . . 7 | |||
| 3. Relationship between IRIs and URIs . . . . . . . . . . . . . 10 | 3. Relationship between IRIs and URIs . . . . . . . . . . . . . 10 | |||
| 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . . 10 | 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . . 10 | |||
| 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . . 12 | 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . . 13 | |||
| 3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 14 | 3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 14 | |||
| 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . 15 | 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . 16 | |||
| 4.1 Logical Storage and Visual Presentation . . . . . . . . . . 16 | 4.1 Logical Storage and Visual Presentation . . . . . . . . . . 16 | |||
| 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . 16 | 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . 17 | |||
| 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . 17 | 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . 18 | |||
| 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 18 | 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 18 | |||
| 5. IRI Equivalence and Comparison . . . . . . . . . . . . . . . 19 | 5. IRI Equivalence and Comparison . . . . . . . . . . . . . . . 20 | |||
| 5.1 Simple String Comparison . . . . . . . . . . . . . . . . . . 20 | 5.1 Simple String Comparison . . . . . . . . . . . . . . . . . . 20 | |||
| 5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . . 20 | 5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . . 21 | |||
| 5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . 20 | 5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . 21 | |||
| 5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . . 21 | 5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . 22 | 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . . 22 | 6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . . 23 | |||
| 6.2 Software Interfaces and Protocols . . . . . . . . . . . . . 22 | 6.2 Software Interfaces and Protocols . . . . . . . . . . . . . 23 | |||
| 6.3 Format of URIs and IRIs in Documents and Protocols . . . . . 23 | 6.3 Format of URIs and IRIs in Documents and Protocols . . . . . 23 | |||
| 6.4 Use of UTF-8 for Encoding Original Characters . . . . . . . 23 | 6.4 Use of UTF-8 for Encoding Original Characters . . . . . . . 24 | |||
| 6.5 Relative IRI References . . . . . . . . . . . . . . . . . . 24 | 6.5 Relative IRI References . . . . . . . . . . . . . . . . . . 25 | |||
| 7. URI/IRI Processing Guidelines (informative) . . . . . . . . 24 | 7. URI/IRI Processing Guidelines (informative) . . . . . . . . 25 | |||
| 7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . . 24 | 7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . . 25 | |||
| 7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . . 25 | 7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . . 26 | |||
| 7.3 URI/IRI Transfer Between Applications . . . . . . . . . . . 26 | 7.3 URI/IRI Transfer Between Applications . . . . . . . . . . . 26 | |||
| 7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . . 26 | 7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . . 27 | |||
| 7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . . 27 | 7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . . 27 | |||
| 7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . . 27 | 7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . . 28 | |||
| 7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . . 28 | 7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . . 28 | |||
| 7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . . 28 | 7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . . 29 | |||
| 8. Security Considerations . . . . . . . . . . . . . . . . . . 29 | 8. Security Considerations . . . . . . . . . . . . . . . . . . 30 | |||
| 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 30 | 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 31 | |||
| Normative References . . . . . . . . . . . . . . . . . . . . 31 | Normative References . . . . . . . . . . . . . . . . . . . . 32 | |||
| Non-normative References . . . . . . . . . . . . . . . . . . 32 | Non-normative References . . . . . . . . . . . . . . . . . . 32 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 34 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 35 | |||
| Full Copyright Statement . . . . . . . . . . . . . . . . . . 35 | Full Copyright Statement . . . . . . . . . . . . . . . . . . 36 | |||
| 1. Introduction | 1. Introduction | |||
| 1.1 Overview and Motivation | 1.1 Overview and Motivation | |||
| A URI is defined in [RFCYYYY] as a sequence of characters chosen from | A URI is defined in [RFCYYYY] as a sequence of characters chosen from | |||
| a limited subset of the repertoire of US-ASCII characters. | a limited subset of the repertoire of US-ASCII characters. | |||
| The characters in URIs are frequently used for representing words of | The characters in URIs are frequently used for representing words of | |||
| natural languages. Such usage has many advantages: such URIs are | natural languages. Such usage has many advantages: such URIs are | |||
| skipping to change at page 4, line 44 | skipping to change at page 4, line 44 | |||
| [RFCYYYY], such as URI references. | [RFCYYYY], such as URI references. | |||
| Using characters outside of A-Z in IRIs brings with it some | Using characters outside of A-Z in IRIs brings with it some | |||
| difficulties; a discussion of potential problems and workarounds can | difficulties; a discussion of potential problems and workarounds can | |||
| be found in the later sections of this document. | be found in the later sections of this document. | |||
| 1.2 Applicability | 1.2 Applicability | |||
| IRIs are designed to be compatible with recent recommendations for | IRIs are designed to be compatible with recent recommendations for | |||
| new URI schemes [RFC2718]. The compatibility is provided by | new URI schemes [RFC2718]. The compatibility is provided by | |||
| providing a well defined and deterministic mapping from the IRI | specifying a well defined and deterministic mapping from the IRI | |||
| character sequence to the functionally equivalent URI character | character sequence to the functionally equivalent URI character | |||
| sequence. Practical use of IRIs (or IRI references) in place of URIs | sequence. Practical use of IRIs (or IRI references) in place of URIs | |||
| (or URI references) depends on the following conditions being met: | (or URI references) depends on the following conditions being met: | |||
| a) The protocol or format element used should be explicitly | a) The protocol or format element used should be explicitly | |||
| designated to carry IRIs. That is, the intent is not to | designated to carry IRIs. That is, the intent is not to | |||
| introduce IRIs into contexts that are not defined to accept | introduce IRIs into contexts that are not defined to accept | |||
| them. For example, XML schema [XMLSchema] has an explicit type | them. For example, XML schema [XMLSchema] has an explicit type | |||
| "anyURI" that designates the use of IRIs. | "anyURI" that designates the use of IRIs. | |||
| skipping to change at page 5, line 17 | skipping to change at page 5, line 17 | |||
| mechanism to represent the wide range of characters used in | mechanism to represent the wide range of characters used in | |||
| IRIs, either natively or by some protocol- or format-specific | IRIs, either natively or by some protocol- or format-specific | |||
| escaping mechanism (for example numeric character references in | escaping mechanism (for example numeric character references in | |||
| [XML1]). | [XML1]). | |||
| c) The URI corresponding to the IRI in question has to encode | c) The URI corresponding to the IRI in question has to encode | |||
| original characters into octets using UTF-8. For new URI | original characters into octets using UTF-8. For new URI | |||
| schemes, this is recommended in [RFC2718]. It can apply to a | schemes, this is recommended in [RFC2718]. It can apply to a | |||
| whole scheme (e.g. IMAP URLs [RFC2192] and POP URLs [RFC2384], | whole scheme (e.g. IMAP URLs [RFC2192] and POP URLs [RFC2384], | |||
| or the URN syntax [RFC2141]). It can apply to a specific part | or the URN syntax [RFC2141]). It can apply to a specific part | |||
| of an URI, such as the fragment identifier (e.g. [XPointer]). | of a URI, such as the fragment identifier (e.g. [XPointer]). | |||
| It can apply to a specific URI or part(s) thereoff. For | It can apply to a specific URI or part(s) thereof. For | |||
| details, please see Section 6.4. | details, please see Section 6.4. | |||
| 1.3 Definitions | 1.3 Definitions | |||
| The following definitions are used in this document; they follow the | The following definitions are used in this document; they follow the | |||
| terms in [RFC2130], [RFC2277] and [ISO10646]: | terms in [RFC2130], [RFC2277] and [ISO10646]: | |||
| character: A member of a set of elements used for the | character: A member of a set of elements used for the | |||
| organization, control, or representation of data. For example, | organization, control, or representation of data. For example, | |||
| "LATIN CAPITAL LETTER A" names a character. | "LATIN CAPITAL LETTER A" names a character. | |||
| skipping to change at page 5, line 45 | skipping to change at page 5, line 45 | |||
| sequence of characters: A sequence (one after another) of | sequence of characters: A sequence (one after another) of | |||
| characters | characters | |||
| sequence of octets: A sequence (one after another) of octets | sequence of octets: A sequence (one after another) of octets | |||
| (character) encoding: A method of representing a sequence of | (character) encoding: A method of representing a sequence of | |||
| characters as a sequence of octets (maybe with variants). A | characters as a sequence of octets (maybe with variants). A | |||
| method of (unambiguously) converting a sequence of octets into | method of (unambiguously) converting a sequence of octets into | |||
| a sequence of characters. | a sequence of characters. | |||
| code point: A placeholder for a character in a character encoding, | ||||
| for example to encode additional characters in future versions | ||||
| of the character encoding. | ||||
| charset: The name of a parameter or attribute used to identify a | charset: The name of a parameter or attribute used to identify a | |||
| character encoding. | character encoding. | |||
| UCS: Universal Character Set; the coded character set defined by | UCS: Universal Character Set; the coded character set defined by | |||
| [ISO10646] and [UNIV4]. | [ISO10646] and [UNIV4]. | |||
| IRI reference: The term "IRI reference" denotes the common usage | IRI reference: The term "IRI reference" denotes the common usage | |||
| of an internationalized resource identifier. An IRI reference | of an internationalized resource identifier. An IRI reference | |||
| may be absolute or relative. However, the "IRI" that results | may be absolute or relative. However, the "IRI" that results | |||
| from such a reference only includes absolute IRIs; any relative | from such a reference only includes absolute IRIs; any relative | |||
| IRIs are resolved to their absolute form. Note that in | IRIs are resolved to their absolute form. Note that in | |||
| [RFC2396], URIs did not include fragment identifiers, but in | [RFC2396], URIs did not include fragment identifiers, but in | |||
| [RFCYYYY], fragment identifiers are part of URIs. | [RFCYYYY], fragment identifiers are part of URIs. | |||
| running text: Human text (paragraphs, sentences, phrases) with | ||||
| syntax according to orthographic conventions of a natural | ||||
| language, as opposed to syntax defined for ease of processing | ||||
| by machines (markup, programming languages,...). | ||||
| 1.4 Notation | 1.4 Notation | |||
| RFCs and Internet Drafts currently do not allow any characters | RFCs and Internet Drafts currently do not allow any characters | |||
| outside the US-ASCII repertoire. Therefore, this document uses | outside the US-ASCII repertoire. Therefore, this document uses | |||
| various special notations to denote such characters in examples. | various special notations to denote such characters in examples. | |||
| In text, characters outside US-ASCII are sometimes referenced by | In text, characters outside US-ASCII are sometimes referenced by | |||
| using a prefix of 'U+', followed by four to six hexadecimal digits. | using a prefix of 'U+', followed by four to six hexadecimal digits. | |||
| To represent characters outside US-ASCII in examples, this document | To represent characters outside US-ASCII in examples, this document | |||
| skipping to change at page 6, line 38 | skipping to change at page 6, line 40 | |||
| XML Notation uses leading '&#x', trailing ';', and the hexadecimal | XML Notation uses leading '&#x', trailing ';', and the hexadecimal | |||
| number of the character in the UCS in between. Example: я | number of the character in the UCS in between. Example: я | |||
| stands for CYRILLIC CAPITAL LETTER YA. In this notation, an actual | stands for CYRILLIC CAPITAL LETTER YA. In this notation, an actual | |||
| '&' is denoted by '&'. | '&' is denoted by '&'. | |||
| Bidi Notation is used for bidirectional examples: lower case ASCII | Bidi Notation is used for bidirectional examples: lower case ASCII | |||
| letters stand for Latin letters or other letters that are written | letters stand for Latin letters or other letters that are written | |||
| left-to-right, whereas upper case letters represent Arabic or Hebrew | left-to-right, whereas upper case letters represent Arabic or Hebrew | |||
| letters that are written right-to-left. | letters that are written right-to-left. | |||
| To denote actual octets in examples (as opposed to escaped octets), | ||||
| the two hex digits denoting the octet are enclosed in "<" and ">". | ||||
| For example, the octet often denoted as 0xc9 is denoted here as <c9>. | ||||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
| document are to be interpreted as described in [RFC2119]. | document are to be interpreted as described in [RFC2119]. | |||
| 2. IRI Syntax | 2. IRI Syntax | |||
| This section defines the syntax of Internationalized Resource | This section defines the syntax of Internationalized Resource | |||
| Identifiers (IRIs). | Identifiers (IRIs). | |||
| As with URIs, an IRI is defined as a sequence of characters, not as a | As with URIs, an IRI is defined as a sequence of characters, not as a | |||
| skipping to change at page 7, line 14 | skipping to change at page 7, line 20 | |||
| these protocols or documents use different character encodings (and/ | these protocols or documents use different character encodings (and/ | |||
| or transfer encodings). Using the same character encoding as the | or transfer encodings). Using the same character encoding as the | |||
| containing protocol or document assures that the characters in the | containing protocol or document assures that the characters in the | |||
| IRI can be handled (searched, converted, displayed,...) in the same | IRI can be handled (searched, converted, displayed,...) in the same | |||
| way as the rest of the protocol or document. | way as the rest of the protocol or document. | |||
| 2.1 Summary of IRI Syntax | 2.1 Summary of IRI Syntax | |||
| IRIs are defined similarly to URIs in [RFCYYYY], but the class of | IRIs are defined similarly to URIs in [RFCYYYY], but the class of | |||
| unreserved characters is extended by adding the characters of the UCS | unreserved characters is extended by adding the characters of the UCS | |||
| (Universal Character Set, [ISO10646]) beyond U+0080, subject to the | (Universal Character Set, [ISO10646]) beyond U+007F, subject to the | |||
| limitations given in the syntax rules below and in Section 6.1. | limitations given in the syntax rules below and in Section 6.1. | |||
| Otherwise, the syntax and use of components and reserved characters | Otherwise, the syntax and use of components and reserved characters | |||
| is the same as that in [RFCYYYY]. All the operations defined in | is the same as that in [RFCYYYY]. All the operations defined in | |||
| [RFCYYYY], such as the resolution of relative URIs, can be applied to | [RFCYYYY], such as the resolution of relative URIs, can be applied to | |||
| IRIs by IRI-processing software in exactly the same way as this is | IRIs by IRI-processing software in exactly the same way as this is | |||
| done to URIs by URI-processing software. | done to URIs by URI-processing software. | |||
| Characters outside the US-ASCII range are not reserved and therefore | Characters outside the US-ASCII range are not reserved and therefore | |||
| MUST NOT be used for syntactical purposes such as to delimit | MUST NOT be used for syntactical purposes such as to delimit | |||
| skipping to change at page 8, line 25 | skipping to change at page 8, line 29 | |||
| ":" / "&" / "=" / "+" / "$" / "," ) | ":" / "&" / "=" / "+" / "$" / "," ) | |||
| ihost = [ IPv6reference / IPv4address / ihostname ] | ihost = [ IPv6reference / IPv4address / ihostname ] | |||
| ihostname = idomainlabel iqualified | ihostname = idomainlabel iqualified | |||
| iqualified = *( "." idomainlabel ) [ "." ] | iqualified = *( "." idomainlabel ) [ "." ] | |||
| idomainlabel = <<See following production rules>> | idomainlabel = <<See following production rules>> | |||
| ipath-segments = isegment *( "/" isegment ) | ipath-segments = ipath-segment *( "/" ipath-segment ) | |||
| isegment = *ipchar | ipath-segment = *ipchar | |||
| ipchar = iunreserved / escaped / ";" / | ipchar = iunreserved / escaped / ";" / | |||
| ":" / "@" / "&" / "=" / "+" / "$" / "," | ":" / "@" / "&" / "=" / "+" / "$" / "," | |||
| iquery = *( ipchar / iprivate / "/" / "?" ) | iquery = *( ipchar / iprivate / "/" / "?" ) | |||
| ifragment = *( ipchar / "/" / "?" ) | ifragment = *( ipchar / "/" / "?" ) | |||
| iric = reserved / iunreserved / escaped | iric = reserved / iunreserved / escaped | |||
| skipping to change at page 8, line 48 | skipping to change at page 9, line 4 | |||
| iunreserved = unreserved / ucschar | iunreserved = unreserved / ucschar | |||
| ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / | ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / | |||
| / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD | / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD | |||
| / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD | / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD | |||
| / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD | / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD | |||
| / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD | / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD | |||
| / %xD0000-DFFFD / %xE1000-EFFFD | / %xD0000-DFFFD / %xE1000-EFFFD | |||
| iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD | iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD | |||
| The 'idomainlabel' production rule is as follows: | The 'idomainlabel' production rule is as follows: | |||
| The value 'idomainlabel' is defined as a string of 'ucschar' obeying | The value 'idomainlabel' is defined as a string of 'ucschar' obeying | |||
| the following rules: | the following rules: | |||
| a) Given a string of 'ucschar' values, the ToASCII operation | a) Given a string of 'ucschar' values, the ToASCII operation | |||
| [RFC3490] is performed on that string with the flag | [RFC3490] is performed on that string with the flag | |||
| UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set | UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set | |||
| to FALSE for creating IRIs and set to TRUE otherwise. | to FALSE for creating IRIs and set to TRUE otherwise. | |||
| b) ToASCII is successful and results in a string conforming to | b) ToASCII is successful. (Note: This means that its output | |||
| 'domainlabel' (see below). | conforms to 'domainlabel' as defined below.) | |||
| The following are the same as [RFCYYYY]: | The following are the same as [RFCYYYY]: | |||
| scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | |||
| port = *DIGIT | port = *DIGIT | |||
| domainlabel = alphanum [ 0*61( alphanum | "-" ) alphanum ] | domainlabel = alphanum [ 0*61( alphanum | "-" ) alphanum ] | |||
| alphanum = ALPHA / DIGIT | alphanum = ALPHA / DIGIT | |||
| skipping to change at page 10, line 38 | skipping to change at page 10, line 41 | |||
| a) Syntactical: Many URI schemes and components define additional | a) Syntactical: Many URI schemes and components define additional | |||
| syntactical restrictions not captured in Section 2.2. Such | syntactical restrictions not captured in Section 2.2. Such | |||
| restrictions can be applied to IRIs by noting that IRIs are | restrictions can be applied to IRIs by noting that IRIs are | |||
| only valid if they map to syntactically valid URIs. This means | only valid if they map to syntactically valid URIs. This means | |||
| that such syntactical restrictions do not have to be defined | that such syntactical restrictions do not have to be defined | |||
| again on the IRI level. | again on the IRI level. | |||
| b) Interpretational: URIs identify resources in various ways. | b) Interpretational: URIs identify resources in various ways. | |||
| IRIs also identify resources. When the IRI is used solely for | IRIs also identify resources. When the IRI is used solely for | |||
| identification purposes, it is not necessary to map the IRI to | identification purposes, it is not necessary to map the IRI to | |||
| an URI (see Section 5). However, when an IRI is used for | a URI (see Section 5). However, when an IRI is used for | |||
| resource retrieval, the resource that the IRI locates is the | resource retrieval, the resource that the IRI locates is the | |||
| same as the one located by the URI obtained after converting | same as the one located by the URI obtained after converting | |||
| the IRI according to the procedure defined here. This means | the IRI according to the procedure defined here. This means | |||
| that there is no need to define resolution separately on the | that there is no need to define resolution separately on the | |||
| IRI level. | IRI level. | |||
| Applications MUST map IRIs to URIs using the following two steps. | Applications MUST map IRIs to URIs using the following two steps. | |||
| Step 1) This step generates a UCS-based encoding from the original | Step 1) This step generates a UCS-based encoding from the original | |||
| IRI format. This step has three variants, depending on the | IRI format. This step has three variants, depending on the | |||
| skipping to change at page 11, line 24 | skipping to change at page 11, line 26 | |||
| from the UCS normalized according to NFC. | from the UCS normalized according to NFC. | |||
| Variant C) If the IRI is in an Unicode-based encoding (for | Variant C) If the IRI is in an Unicode-based encoding (for | |||
| example UTF-8 or UTF-16): Do not normalize. Move | example UTF-8 or UTF-16): Do not normalize. Move | |||
| directly to Step 2. | directly to Step 2. | |||
| Step 2) If the IRI contains an 'ihostname' part, replace this | Step 2) If the IRI contains an 'ihostname' part, replace this | |||
| 'ihostname' part by the part converted using the ToASCII | 'ihostname' part by the part converted using the ToASCII | |||
| operation specified in Section 4.1 of [RFC3490], with the flag | operation specified in Section 4.1 of [RFC3490], with the flag | |||
| UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set | UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set | |||
| to FALSE for creating IRIs and set to TRUE otherwise. | to FALSE for creating IRIs and set to TRUE otherwise. The | |||
| ToASCII operation may fail, but only if the IRI does not | ||||
| conform to the rules in Section 2.2. | ||||
| Step 3) For each character that is disallowed in URI references, | Step 3) For each character that is disallowed in URI references, | |||
| apply steps 1) through 3) below. The disallowed characters | apply steps 1) through 3) below. The disallowed characters | |||
| consist of all non-ASCII characters allowed in IRIs. | consist of all non-ASCII characters allowed in IRIs. | |||
| 1) Convert the character to a sequence of one or more octets | 1) Convert the character to a sequence of one or more octets | |||
| using UTF-8 [RFCXXXX]. | using UTF-8 [RFCXXXX]. | |||
| 2) Convert each octet to %HH, where HH is the hexadecimal | 2) Convert each octet to %HH, where HH is the hexadecimal | |||
| notation of the octet value. Note: This is identical to | notation of the octet value. Note: This is identical to | |||
| the escaping mechanism in Section 2.4.1 of [RFCYYYY]. | the escaping mechanism in Section 2.4.1 of [RFCYYYY]. To | |||
| Note: To reduce variability, the hexadecimal notation | reduce variability, the hexadecimal notation SHOULD use | |||
| SHOULD use upper case letters. | upper case letters. | |||
| 3) Replace the original character by the resulting character | 3) Replace the original character by the resulting character | |||
| sequence (i.e. a sequence of %HH triplets). | sequence (i.e. a sequence of %HH triplets). | |||
| Note that the ToASCII operation in Step 2) may fail, but only if the | The above mapping from IRIs to URIs produces URIs fully conforming to | |||
| IRI does not conform to the rules in Section 2.2. | [RFCYYYY]. The mapping is also an identity transformation for URIs | |||
| and is idempotent -- applying the mapping a second time will not | ||||
| change anything. Every URI is by definition an IRI. | ||||
| Note: For backwards compatibility with implementations of previous | Infrastructure accepting IRIs MAY also deal with 'ihostname' parts | |||
| drafts of this specification, infrastructure accepting IRIs MAY also | escaped according to Step 3) rather than Step 2). For example, Step | |||
| deal with 'ihostname' parts escaped according to Step 3) rather than | 2) converts the IRI | |||
| Step 2). For example, Step 2) converts the IRI | ||||
| http://résumé.example.org to | http://résumé.example.org to | |||
| http://xn--rsum-bpad.example.org. For backwards compatibility, | http://xn--rsum-bpad.example.org. For backward compatibility, | |||
| http://r%C3%A9sum%C3%A9.example.org would also be converted to | http://r%C3%A9sum%C3%A9.example.org would also be converted to | |||
| http://xn--rsum-bpad.example.org. | http://xn--rsum-bpad.example.org. | |||
| Note that Internationalized Domain Names may be contained in parts of | Infrastructure accepting IRIs MAY also deal with the printable | |||
| an IRI other than the 'ihostname' part. | characters in US-ASCII that are not allowed in URIs, namely "<", ">", | |||
| '"', Space, "{", "}", "|", "\", "^", and "`", in step 3) above. If | ||||
| such characters are found but are not converted, then the conversion | ||||
| SHOULD fail. Please note that the number sign ("#"), the percent | ||||
| sign ("%"), and the square bracket characters ("[", "]") are not part | ||||
| of the above list, and MUST NOT be converted. Protocols and formats | ||||
| that have used earlier definitions of IRIs including these characters | ||||
| MAY require unescaping of these characters as a preprocessing step to | ||||
| extract the actual IRI from a given field. Such preprocessing MAY | ||||
| also be used by applications allowing the user to enter an IRI. | ||||
| Note that in this process (in step 3.3), characters allowed in URI | Internationalized Domain Names may be contained in parts of an | |||
| IRI other than the 'ihostname' part. In this case, Step 2) is | ||||
| not used, but Step 3) is applied. This is important to | ||||
| maintain uniform treatment of URIs. See [Gettys] for an in- | ||||
| depth discussion. It is the responsibility of scheme-specific | ||||
| implementations (if the Internationalized Domain Name is part | ||||
| of the scheme syntax) or of server-side implementations (if the | ||||
| Internationalized Domain Name is part of 'iquery') to apply the | ||||
| necessary conversions at the appropriate point. Example: | ||||
| Trying to validate the Web page at | ||||
| http://résumé.example.org would lead to an IRI of | ||||
| http://validator.w3.org/ | ||||
| check?uri=http%3A%2F%2Frésumé.example.org, which | ||||
| would convert to a URI of | ||||
| http://validator.w3.org/ | ||||
| check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.example.org. The | ||||
| server side implementation would be responsible to do the | ||||
| necessary conversions in order to be able to retrieve the Web | ||||
| page. | ||||
| In this process (in step 3.3), characters allowed in URI | ||||
| references as well as existing escape sequences are not escaped | references as well as existing escape sequences are not escaped | |||
| further. (This mapping is similar to, but different from, the | further. (This mapping is similar to, but different from, the | |||
| escaping applied when including arbitrary content into some part of a | escaping applied when including arbitrary content into some | |||
| URI.) For example, an IRI of | part of a URI.) For example, an IRI of | |||
| http://www.example.org/red%09rosé#red (in XML notation) is | http://www.example.org/red%09rosé#red (in XML notation) is | |||
| converted to | converted to | |||
| http://www.example.org/red%09ros%C3%A9#red, not to something like | http://www.example.org/red%09ros%C3%A9#red, not to something | |||
| like | ||||
| http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red. | http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red. | |||
| Note that some older software transcoding to UTF-8 may produce | Some older software transcoding to UTF-8 may produce illegal | |||
| illegal output for some input, in particular for characters outside | output for some input, in particular for characters outside the | |||
| the BMP (Basic Multilingual Plane). As an example, for the following | BMP (Basic Multilingual Plane). As an example, for the | |||
| IRI with non-BMP characters (in XML Notation): | following IRI with non-BMP characters (in XML Notation): | |||
| http://example.com/𐌀𐌁𐌁 | http://example.com/𐌀𐌁𐌂 | |||
| (the first three letters of the Old Italic alphabet) the correct | (the first three letters of the Old Italic alphabet) the | |||
| conversion to a URI is: | correct conversion to a URI is: | |||
| http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82 | http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82 | |||
| The above mapping produces a URI fully conforming to [RFCYYYY] out of | ||||
| each IRI. The mapping is also an identity transformation for URIs | ||||
| and is idempotent -- applying the mapping a second time will not | ||||
| change anything. Every URI is therefore by definition an IRI. | ||||
| Note: Earlier drafts of this specification allowed the space | ||||
| character and various delimiters in IRIs and IRI references. The | ||||
| full list of these characters was: "<", ">", '"', Space, "{", "}", | ||||
| "|", "\", "^", and "`", i.e. all printable characters in US-ASCII | ||||
| that are not allowed in URIs. For backwards compatibility, | ||||
| implementations MAY also include these characters in step 3) above. | ||||
| If such characters are found but are not converted, then the | ||||
| conversion SHOULD fail. Please note that the number sign ("#"), the | ||||
| percent sign ("%"), and the square bracket characters ("[", "]") are | ||||
| not part of the above list, and MUST not be converted. Protocols and | ||||
| formats that have used earlier definitions of IRIs including these | ||||
| characters MAY require unescaping of these characters as a | ||||
| preprocessing step to extract the actual IRI from a given field. | ||||
| Such preprocessing MAY also be used by applications allowing the user | ||||
| to enter an IRI. | ||||
| 3.2 Converting URIs to IRIs | 3.2 Converting URIs to IRIs | |||
| In some situations, it may be desirable to try to convert a URI into | In some situations, it may be desirable to try to convert a URI into | |||
| an equivalent IRI. This section gives a procedure to do such a | an equivalent IRI. This section gives a procedure to do such a | |||
| conversion. The conversion described in this section will always | conversion. The conversion described in this section will always | |||
| result in an IRI which maps back to the URI that was used as an input | result in an IRI which maps back to the URI that was used as an input | |||
| for the conversion (except for potential case differences in escape | for the conversion (except for potential case differences in escape | |||
| sequences). However, the IRI resulting from this conversion may not | sequences). However, the IRI resulting from this conversion may not | |||
| be exactly the same as the original IRI (if there ever was one). | be exactly the same as the original IRI (if there ever was one). | |||
| skipping to change at page 13, line 32 | skipping to change at page 13, line 48 | |||
| discussion, see [Duerst97].) | discussion, see [Duerst97].) | |||
| c) The conversion may result in a character that is not | c) The conversion may result in a character that is not | |||
| appropriate in an IRI. See Section 6.1 for further details. | appropriate in an IRI. See Section 6.1 for further details. | |||
| Conversion from a URI to an IRI is done using the following steps (or | Conversion from a URI to an IRI is done using the following steps (or | |||
| any other algorithm that produces the same result): | any other algorithm that produces the same result): | |||
| 1) Represent the URI as a sequence of octets in US-ASCII. | 1) Represent the URI as a sequence of octets in US-ASCII. | |||
| 2) Replace any punycode-encoded domainlabel in the URI by the | 2) Apply the ToUnicode operation to each 'domainlabel' in the | |||
| result of the ToUnicode function represented as UTF-8. | 'hostname' part (if there is one), representing the output as | |||
| UTF-8. | ||||
| 3) Convert all hexadecimal escapes (% followed by two hexadecimal | 3) Convert all hexadecimal escapes (% followed by two hexadecimal | |||
| digits) except those corresponding to '%', characters in | digits) except those corresponding to '%', characters in | |||
| 'reserved', and characters in US-ASCII not allowed in URIs, to | 'reserved', and characters in US-ASCII not allowed in URIs, to | |||
| the corresponding octets. | the corresponding octets. | |||
| 4) Re-escape any octet produced in step 3) that is not part of a | 4) Re-escape any octet produced in step 3) that is not part of a | |||
| strictly legal UTF-8 octet sequence. | strictly legal UTF-8 octet sequence. | |||
| 5) Re-escape all octets produced in step 3) that in UTF-8 | 5) Re-escape all octets produced in step 3) that in UTF-8 | |||
| skipping to change at page 14, line 34 | skipping to change at page 15, line 4 | |||
| The following example contains the sequence '%C3%BC', which is a | The following example contains the sequence '%C3%BC', which is a | |||
| strictly legal UTF-8 sequence, and which is converted into the actual | strictly legal UTF-8 sequence, and which is converted into the actual | |||
| character U+00FC LATIN SMALL LETTER U WITH DIAERESIS (also known as | character U+00FC LATIN SMALL LETTER U WITH DIAERESIS (also known as | |||
| u-umlaut). | u-umlaut). | |||
| 1) http://www.example.org/D%C3%BCrst | 1) http://www.example.org/D%C3%BCrst | |||
| 2) http://www.example.org/D%C3%BCrst | 2) http://www.example.org/D%C3%BCrst | |||
| 3) http://www.example.org/D<c3><bc>rst | 3) http://www.example.org/D<c3><bc>rst | |||
| 4) http://www.example.org/D<c3><bc>rst | 4) http://www.example.org/D<c3><bc>rst | |||
| 5) http://www.example.org/D<c3><bc>rst | 5) http://www.example.org/D<c3><bc>rst | |||
| 6) http://www.example.org/Dürst | 6) http://www.example.org/Dürst | |||
| The following example contains the sequence '%FC', which might | The following example contains the sequence '%FC', which might | |||
| represent U+00FC LATIN SMALL LETTER U WITH DIAERESIS in the | represent U+00FC LATIN SMALL LETTER U WITH DIAERESIS in the | |||
| iso-8859-1 encoding. (It might represent other characters in other | iso-8859-1 encoding. (It might represent other characters in other | |||
| encodings. For example, the octet <FC> in iso-8859-5 represents | encodings. For example, the octet <fc> in iso-8859-5 represents | |||
| U+045C CYRILLIC SMALL LETTER KJE.) Because <FC> is not part of a | U+045C CYRILLIC SMALL LETTER KJE.) Because <fc> is not part of a | |||
| strictly legal UTF-8 sequence, it is re-escaped in step 2). | strictly legal UTF-8 sequence, it is re-escaped in step 2). | |||
| 1) http://www.example.org/D%FCrst | 1) http://www.example.org/D%FCrst | |||
| 2) http://www.example.org/D%FCrst | 2) http://www.example.org/D%FCrst | |||
| 3) http://www.example.org/D<FC>rst | ||||
| 3) http://www.example.org/D<fc>rst | ||||
| 4) http://www.example.org/D%FCrst | 4) http://www.example.org/D%FCrst | |||
| 5) http://www.example.org/D%FCrst | 5) http://www.example.org/D%FCrst | |||
| 6) http://www.example.org/D%FCrst | 6) http://www.example.org/D%FCrst | |||
| The following example contains '%e2%80%ae', which is the escaped | The following example contains '%e2%80%ae', which is the escaped | |||
| UTF-8 encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. Section 4.1 | UTF-8 encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. Section 4.1 | |||
| forbids the direct use of this character in an IRI. Therefore, the | forbids the direct use of this character in an IRI. Therefore, the | |||
| corresponding octets are re-escaped in step 5). This example shows | corresponding octets are re-escaped in step 5). This example shows | |||
| that the case (upper or lower) of letters used in escapes may not be | that the case (upper or lower) of letters used in escapes may not be | |||
| preserved. The example also contains a punycode-encoded domain name | preserved. The example also contains a punycode-encoded domain name | |||
| label (xn--99zt52a), which is converted to the corresponding | label (xn--99zt52a), which is converted to the corresponding | |||
| characters U+7D0D U+8C46 (Japanese Natto). | characters U+7D0D U+8C46 (Japanese Natto). | |||
| 1) http://xn--99zt52a.example.org/%e2%80%ae | 1) http://xn--99zt52a.example.org/%e2%80%ae | |||
| 2) http://<E7><B4><8D><E8><B1><86>.example.org/%e2%80%ae | 2) http://<e7><b4><8d><e8><b1><86>.example.org/%e2%80%ae | |||
| 3) http://<E7><B4><8D><E8><B1><86>.example.org/<E2><80><AE> | 3) http://<e7><b4><8d><e8><b1><86>.example.org/<e2><80><ae> | |||
| 4) http://<E7><B4><8D><E8><B1><86>.example.org/<E2><80><AE> | 4) http://<e7><b4><8d><e8><b1><86>.example.org/<e2><80><ae> | |||
| 5) http://<E7><B4><8D><E8><B1><86>.example.org/%E2%80%AE | 5) http://<e7><b4><8d><e8><b1><86>.example.org/%E2%80%AE | |||
| 6) http://納豆.example.org/%E2%80%AE | 6) http://納豆.example.org/%E2%80%AE | |||
| 4. Bidirectional IRIs for Right-to-left Languages | 4. Bidirectional IRIs for Right-to-left Languages | |||
| Some UCS characters, such as those used in the Arabic and Hebrew | Some UCS characters, such as those used in the Arabic and Hebrew | |||
| script, have an inherent right-to-left (rtl) writing direction. IRIs | script, have an inherent right-to-left (rtl) writing direction. IRIs | |||
| containing such characters (called bidirectional IRIs or Bidi IRIs) | containing such characters (called bidirectional IRIs or Bidi IRIs) | |||
| require additional attention because of the non-trivial relation | require additional attention because of the non-trivial relation | |||
| between logical representation (used for digital representation as | between logical representation (used for digital representation as | |||
| skipping to change at page 16, line 4 | skipping to change at page 16, line 22 | |||
| well as when reading/spelling) and visual representation (used for | well as when reading/spelling) and visual representation (used for | |||
| display/printing). | display/printing). | |||
| Because of the complex interaction between the logical | Because of the complex interaction between the logical | |||
| representation, the visual representation, and the syntax of a Bidi | representation, the visual representation, and the syntax of a Bidi | |||
| IRI, a balance is needed between various requirements. The main | IRI, a balance is needed between various requirements. The main | |||
| requirements are: | requirements are: | |||
| 1) user-predictable conversion between visual and logical | 1) user-predictable conversion between visual and logical | |||
| representation; | representation; | |||
| 2) the ability to include a wide range of characters in various | 2) the ability to include a wide range of characters in various | |||
| parts of the IRI; | parts of the IRI; | |||
| 3) no or not too big changes or restrictions for implementations. | 3) minor or no changes or restrictions for implementations. | |||
| 4.1 Logical Storage and Visual Presentation | 4.1 Logical Storage and Visual Presentation | |||
| When stored or transmitted in digital representation, bidirectional | When stored or transmitted in digital representation, bidirectional | |||
| IRIs MUST be in full logical order, and MUST conform to the IRI | IRIs MUST be in full logical order, and MUST conform to the IRI | |||
| syntax rules (which includes the rules relevant to their scheme). | syntax rules (which includes the rules relevant to their scheme). | |||
| This assures that bidirectional IRIs can be processed in the same way | This assures that bidirectional IRIs can be processed in the same way | |||
| as other IRIs. | as other IRIs. | |||
| When rendered, bidirectional IRIs MUST be rendered using the Unicode | When rendered, bidirectional IRIs MUST be rendered using the Unicode | |||
| Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be | Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be | |||
| rendered with an overall left-to-right (ltr) direction. | rendered with an overall left-to-right (ltr) direction. | |||
| In text with a left-to-right base directionality or embedding (as | In text with a left-to-right base directionality or embedding (such | |||
| used for e.g. English or Cyrillic), the Unicode Bidirectional | as used for English or Cyrillic), the Unicode Bidirectional Algorithm | |||
| Algorithm will automatically use an overall ltr direction for the | will automatically use an overall ltr direction for the IRI. In text | |||
| IRI. In text with a rtl base directionality or embedding (as used | with a rtl base directionality or embedding (such as used for Arabic | |||
| e.g. for Arabic or Hebrew), setting a different embedding direction | or Hebrew), setting a different embedding direction for the IRI is | |||
| for the IRI is needed. Setting the embedding direction can be done | needed. Setting the embedding direction can be done in a higher- | |||
| in a higher-order protocol (e.g. the dir='ltr' attribute in HTML). | order protocol (e.g. the dir='ltr' attribute in HTML). If this is | |||
| If this is not available (e.g. in plain text), setting the embedding | not available (e.g. in plain text), setting the embedding is done | |||
| is done with Unicode bidi formatting codes, i.e. U+202A, LEFT-TO- | with Unicode bidi formatting codes, i.e. U+202A, LEFT-TO-RIGHT | |||
| RIGHT EMBEDDING (LRE) before the IRI, and U+202C, POP DIRECTIONAL | EMBEDDING (LRE) before the IRI, and U+202C, POP DIRECTIONAL | |||
| FORMATTING (PDF) after the IRI, both not being part of the IRI | FORMATTING (PDF) after the IRI, both not being part of the IRI | |||
| itself. | itself. | |||
| IRIs MUST NOT contain bidirectional formatting characters (LRM, RLM, | IRIs MUST NOT contain bidirectional formatting characters (LRM, RLM, | |||
| LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of | LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of | |||
| the IRI, but do not themselves appear visually. It would therefore | the IRI, but do not themselves appear visually. It would therefore | |||
| not be possible to correctly input an IRI with such characters. | not be possible to correctly input an IRI with such characters. | |||
| 4.2 Bidi IRI Structure | 4.2 Bidi IRI Structure | |||
| The Unicode Bidirectional Algorithm is designed mainly for running | The Unicode Bidirectional Algorithm is designed mainly for running | |||
| text. To make sure that it does not affect the rendering of | text. To make sure that it does not affect the rendering of | |||
| bidirectional IRIs too much, some restrictions on bidirectional IRIs | bidirectional IRIs too much, some restrictions on bidirectional IRIs | |||
| are necessary. These restrictions are given in terms of delimiters | are necessary. These restrictions are given in terms of delimiters | |||
| (structural characters, mostly punctuation such as '@', '.', ':', | (structural characters, mostly punctuation such as '@', '.', ':', | |||
| '/') and components (usually consisting mostly of letters and | '/') and components (usually consisting mostly of letters and | |||
| digits). | digits). | |||
| The following syntax rules from Section 2.2 correspond to components | The following syntax rules from Section 2.2 correspond to components | |||
| for the purpose of Bidi behavior: iuserinfo, isegment, ihostname, | for the purpose of Bidi behavior: iuserinfo, ipath-segment, | |||
| iquery, and ifragment. | ihostname, iquery, and ifragment. | |||
| Specifications that define the syntax of any of the above components | Specifications that define the syntax of any of the above components | |||
| MAY divide them further and define smaller parts to be components | MAY divide them further and define smaller parts to be components | |||
| according to this document. As an example, the restrictions of | according to this document. As an example, the restrictions of | |||
| [RFC3490] on bidirectional domain names correspond to treating each | [RFC3490] on bidirectional domain names correspond to treating each | |||
| label of the domain name as a component. Even where the components | label of the domain name as a component. Even where the components | |||
| are not defined formally, it may be helpful to think about some | are not defined formally, it may be helpful to think about some | |||
| syntax in terms of components and to apply the relevant restrictions. | syntax in terms of components and to apply the relevant restrictions. | |||
| For example, for the usual name/value syntax in query parts, it is | For example, for the usual name/value syntax in query parts, it is | |||
| convenient to treat each name and each value as a component. As | convenient to treat each name and each value as a component. As | |||
| skipping to change at page 17, line 32 | skipping to change at page 17, line 50 | |||
| 2) A component using right-to-left characters SHOULD start and end | 2) A component using right-to-left characters SHOULD start and end | |||
| with right-to-left characters. | with right-to-left characters. | |||
| The above restrictions are given as shoulds, rather than as musts. | The above restrictions are given as shoulds, rather than as musts. | |||
| For IRIs that are never presented visually, they are not relevant. | For IRIs that are never presented visually, they are not relevant. | |||
| However, for IRIs in general, they are very important to insure | However, for IRIs in general, they are very important to insure | |||
| consistent conversion between visual presentation and logical | consistent conversion between visual presentation and logical | |||
| representation, in both directions. | representation, in both directions. | |||
| In some components, the above restrictions may actually be strictly | In some components, the above restrictions may actually be | |||
| enforced. For example, [RFC3490] requires that these restrictions | strictly enforced. For example, [RFC3490] requires that these | |||
| apply to the labels of the host name part of an IRI. In some other | restrictions apply to the labels of the host name part of an | |||
| components, for example path components, following these restrictions | IRI. In some other components, for example path components, | |||
| may not be too difficult. For other components, such as parts of the | following these restrictions may not be too difficult. For | |||
| query part, it may be very difficult to enforce the restrictions, | other components, such as parts of the query part, it may be | |||
| because the values of query parameters may be arbitrary character | very difficult to enforce the restrictions, because the values | |||
| sequences. | of query parameters may be arbitrary character sequences. | |||
| If the above restrictions cannot be satisfied otherwise, the affected | If the above restrictions cannot be satisfied otherwise, the affected | |||
| component can always be mapped to URI notation as described in | component can always be mapped to URI notation as described in | |||
| Section 3.1. Please note that the whole component needs to be mapped | Section 3.1. Please note that the whole component needs to be mapped | |||
| (see also Example 9 below). | (see also Example 9 below). | |||
| 4.3 Input of Bidi IRIs | 4.3 Input of Bidi IRIs | |||
| Bidi input methods MUST generate Bidi IRIs in logical order while | Bidi input methods MUST generate Bidi IRIs in logical order while | |||
| rendering them according to Section 4.1. During input, rendering | rendering them according to Section 4.1. During input, rendering | |||
| skipping to change at page 19, line 34 | skipping to change at page 20, line 4 | |||
| component: | component: | |||
| logical representation: http://ab.cd.ef/GH1/2IJ/KL.html | logical representation: http://ab.cd.ef/GH1/2IJ/KL.html | |||
| visual representation: http://ab.cd.ef/LK/JI1/2HG.html | visual representation: http://ab.cd.ef/LK/JI1/2HG.html | |||
| The sequence '1/2' is interpreted by the bidi algorithm as a | The sequence '1/2' is interpreted by the bidi algorithm as a | |||
| fraction, fragmenting the components and leading to confusion. There | fraction, fragmenting the components and leading to confusion. There | |||
| are other characters that are interpreted in a special way close to | are other characters that are interpreted in a special way close to | |||
| numbers, in particular '+', '-', '#', '$', '%', ',', '.', and ':'. | numbers, in particular '+', '-', '#', '$', '%', ',', '.', and ':'. | |||
| Example 9 (not allowed): The numbers in the previous example are | Example 9 (not allowed): The numbers in the previous example are | |||
| escaped: | escaped: | |||
| logical representation: http://ab.cd.ef/GH%31/%32IJ/KL.html, | logical representation: http://ab.cd.ef/GH%31/%32IJ/KL.html, | |||
| visual representation (Hebrew): http://ab.cd.ef/LK/JI%32/%31HG.html | visual representation (Hebrew): http://ab.cd.ef/LK/JI%32/%31HG.html | |||
| visual representation (Arabic): http://ab.cd.ef/LK/JI32%/31%HG.html | visual representation (Arabic): http://ab.cd.ef/LK/JI32%/31%HG.html | |||
| Depending on whether the upper-case letters represent Arabic or | Depending on whether the upper-case letters represent Arabic or | |||
| Hebrew, the visual representation is different. | Hebrew, the visual representation is different. | |||
| Example 10 (allowed, but not recommended): | ||||
| logical representation: http://ab.CDEFGH.123/kl/mn/op.html | ||||
| visual representation: http://ab.123.HGFEDC/kl/mn/op.html | ||||
| Components consisting of only numbers are allowed (it would be rather | ||||
| difficult to prohibit them), but may interact with adjacent RTL | ||||
| components in ways that are not easy to predict. | ||||
| 5. IRI Equivalence and Comparison | 5. IRI Equivalence and Comparison | |||
| This section discusses IRI Equivalence and Comparison similar to | This section discusses IRI Equivalence and Comparison similar to | |||
| Section 6, "Normalization and Comparison", in [RFCYYYY]. This | Section 6, "Normalization and Comparison", in [RFCYYYY]. This | |||
| section focusses on the main issues and on aspects that are different | section focuses on the main issues and on aspects that are different | |||
| from [RFCYYYY]; Section 6 of [RFCYYYY] is recommended background | from [RFCYYYY]; Section 6 of [RFCYYYY] is recommended background | |||
| reading. | reading. | |||
| There is no general rule or procedure to decide whether two arbitrary | There is no general rule or procedure to decide whether two arbitrary | |||
| IRIs are equivalent or not (i.e. whether they refer to the same | IRIs are equivalent or not (i.e. whether they refer to the same | |||
| resource or not). Two IRIs that look almost the same may refer to | resource or not). Two IRIs that look almost the same may refer to | |||
| different resources. Two IRIs that look completely different may | different resources. Two IRIs that look completely different may | |||
| refer to the same resource. Each specification or application that | refer to the same resource. Each specification or application that | |||
| uses IRIs has to decide on the appropriate criterion for IRI | uses IRIs has to decide on the appropriate criterion for IRI | |||
| equivalence. | equivalence. | |||
| skipping to change at page 21, line 10 | skipping to change at page 21, line 36 | |||
| The Unicode Standard [UNIV4] defines various equivalences between | The Unicode Standard [UNIV4] defines various equivalences between | |||
| sequences of characters for various purposes. Unicode Standard Annex | sequences of characters for various purposes. Unicode Standard Annex | |||
| #15 [UTR15] defines various Normalization Forms for these | #15 [UTR15] defines various Normalization Forms for these | |||
| equivalences, in particular Normalization Form C (NFC, Canonical | equivalences, in particular Normalization Form C (NFC, Canonical | |||
| Decomposition, followed by Canonical Composition) and Normalization | Decomposition, followed by Canonical Composition) and Normalization | |||
| Form KC (NFKC, Compatibility Decomposition, followed by Canonical | Form KC (NFKC, Compatibility Decomposition, followed by Canonical | |||
| Composition). | Composition). | |||
| Equivalence of IRIs MUST rely on the assumption that IRIs are | Equivalence of IRIs MUST rely on the assumption that IRIs are | |||
| appropriately pre-normalized, rather than applying normalization when | appropriately pre-normalized, rather than applying normalization when | |||
| comparing two IRIs. The exceptions are convertsion from a non- | comparing two IRIs. The exceptions are conversion from a non-digital | |||
| digital form, and conversion from a non-UCS-based encoding to an UCS- | form, and conversion from a non-UCS-based encoding to an UCS-based | |||
| based encoding. In these cases, NFC or a normalizing transcoder | encoding. In these cases, NFC or a normalizing transcoder using NFC | |||
| using NFC MUST be used for interoperability. To avoid false | MUST be used for interoperability. To avoid false negatives and | |||
| negatives and problems with transcoding, IRIs SHOULD be created using | problems with transcoding, IRIs SHOULD be created using NFC. Using | |||
| NFC. Using NFKC will avoid even more problems. | NFKC may avoid even more problems, for example by choosing half-width | |||
| Latin letters instead of full-width, and full-width Katakana instead | ||||
| of half-width. | ||||
| As an example, http://www.example.org/résumé.html (in XML | As an example, http://www.example.org/résumé.html (in XML | |||
| Notation) is in NFC. On the other hand, http://www.example.org/ | Notation) is in NFC. On the other hand, http://www.example.org/ | |||
| résumé.html is not in NFC. The former uses precombined | résumé.html is not in NFC. The former uses precombined | |||
| e-acute characters, the later uses 'e' characters followed by | e-acute characters, the later uses 'e' characters followed by | |||
| combining acute accents. Both usages are defined to be canonically | combining acute accents. Both usages are defined to be canonically | |||
| equivalent in [UNIV4]. | equivalent in [UNIV4]. | |||
| Because we do not know how a particular field is treated with respect | Because it is unknow how a particular field is being treated | |||
| to text normalization, it would be inappropriate to allow third | with respect to text normalization, it would be inappropriate | |||
| parties to normalize an IRI arbitrarily. This does not contradict | to allow third parties to normalize an IRI arbitrarily. This | |||
| the recommendation that if you create a resource, and an IRI for that | does not contradict the recommendation that when a resource is | |||
| resource, you try to be as normalized as possible (i.e. NFKC if | created, and an IRI for that resource, you try to be as | |||
| possible). This is similar to the upper-case/lower-case problems in | normalized as possible (i.e. NFC or even NFKC). This is | |||
| URIs. Some parts of an URI are case-insensitive (domain name). For | similar to the upper-case/lower-case problems in URIs. Some | |||
| others, it is unclear whether they are case-sensitive or case- | parts of a URI are case-insensitive (domain name). For others, | |||
| insensitive, or something in between (e.g. case-sensitive, but if | it is unclear whether they are case-sensitive or case- | |||
| you use the wrong case, may not directly get a result, but rather a | insensitive, or something in between (e.g. case-sensitive, but | |||
| 'Multiple choices'). The best recipe we have there is that the | if the wrong case is used, a multiple choice selection is | |||
| generator uses a reasonable capitalization, and when transfering the | provided instead of a direct negative result). The best recipe | |||
| URI, you do not change capitalization. | is that the generator uses a reasonable capitalization, and | |||
| when transfering the URI, that capitalization is never changed. | ||||
| Various IRI schemes may allow the usage of International Domain Names | Various IRI schemes may allow the usage of International Domain Names | |||
| (IDN) [RFC3490]. When in use in IRIs, those names SHOULD be | (IDN) [RFC3490]. When in use in IRIs, those names SHOULD be | |||
| validated using the ToASCII operation defined in [RFC3490], with the | validated using the ToASCII operation defined in [RFC3490], with the | |||
| flags "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing | flags "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing | |||
| an invalid IDN cannot successfully be resolved. For legibility | an invalid IDN cannot successfully be resolved. For legibility | |||
| purposes, IDN components of IRIs SHOULD not be converted into ASCII | purposes, IDN components of IRIs SHOULD NOT be converted into ASCII | |||
| Compatible Encoding (ACE). However, this conversion is applied when | Compatible Encoding (ACE). However, this conversion is applied when | |||
| mapping an IRI into an URI, see Section 3.1. | mapping an IRI into a URI, see Section 3.1. | |||
| 5.4 Preferred Forms | 5.4 Preferred Forms | |||
| The following are the preferred forms for IRIs when generated: | The following are the preferred forms for IRIs when generated: | |||
| - Always provide the URI scheme in lowercase characters. | - Always provide the URI scheme in lowercase characters. | |||
| - Only perform percent-escaping where it is essential. | - Only perform percent-escaping where it is essential. | |||
| - Always use uppercase A-through-F characters when percent- | - Always use uppercase A-through-F characters when percent- | |||
| escaping. | escaping. | |||
| - Always provide the hostname, if any, in the form produced when | - Always provide the hostname, if any, in the form produced when | |||
| applying [RFC3491]. This in particular includes using | applying nameprep [RFC3491]. This in particular includes using | |||
| lowercase characters rather than uppercase characters where | lowercase characters rather than uppercase characters where | |||
| applicable. | applicable. | |||
| - Where possible, provide IRI components in NFKC or NFC. | - Where possible, provide IRI components in NFKC or NFC. | |||
| - Prevent /./ and /../ from appearing in non-relative URI paths. | - Prevent /./ and /../ from appearing in non-relative URI paths. | |||
| 6. Use of IRIs | 6. Use of IRIs | |||
| 6.1 Limitations on UCS Characters Allowed in IRIs | 6.1 Limitations on UCS Characters Allowed in IRIs | |||
| skipping to change at page 22, line 46 | skipping to change at page 23, line 30 | |||
| strong visual look-alikes. Because of the likelihood of | strong visual look-alikes. Because of the likelihood of | |||
| transcription errors, these also should be avoided. This | transcription errors, these also should be avoided. This | |||
| includes the full-width equivalents of ASCII characters, half- | includes the full-width equivalents of ASCII characters, half- | |||
| width Katakana characters for Japanese, and many others. This | width Katakana characters for Japanese, and many others. This | |||
| also includes many look-alikes of "space", "delims", and | also includes many look-alikes of "space", "delims", and | |||
| "unwise", characters excluded in [RFC3491]. | "unwise", characters excluded in [RFC3491]. | |||
| Additional information is available from [UNIXML]. [UNIXML] is | Additional information is available from [UNIXML]. [UNIXML] is | |||
| written in the context of running text rather than in the context of | written in the context of running text rather than in the context of | |||
| identifiers. Nevertheless, it discusses many of the categories of | identifiers. Nevertheless, it discusses many of the categories of | |||
| characters and code points not appropriate for IRIs. | characters not appropriate for IRIs. | |||
| 6.2 Software Interfaces and Protocols | 6.2 Software Interfaces and Protocols | |||
| Although an IRI is defined as a sequence of characters, software | Although an IRI is defined as a sequence of characters, software | |||
| interfaces for URIs typically function on sequences of octets or | interfaces for URIs typically function on sequences of octets or | |||
| other kinds of code units. Thus, software interfaces and protocols | other kinds of code units. Thus, software interfaces and protocols | |||
| MUST define which character encoding is used. | MUST define which character encoding is used. | |||
| Intermediate software interfaces between IRI-capable components and | Intermediate software interfaces between IRI-capable components and | |||
| URI-only components MUST map the IRIs per Section 3.1, when | URI-only components MUST map the IRIs per Section 3.1, when | |||
| transferring from IRI-capable to URI-only components. Such a mapping | transferring from IRI-capable to URI-only components. Such a mapping | |||
| SHOULD be applied as late as possible. It should not be applied | SHOULD be applied as late as possible. It SHOULD NOT be applied | |||
| between components that are known to be able to handle IRIs. | between components that are known to be able to handle IRIs. | |||
| 6.3 Format of URIs and IRIs in Documents and Protocols | 6.3 Format of URIs and IRIs in Documents and Protocols | |||
| Document formats that transport URIs may need to be upgraded to allow | Document formats that transport URIs may need to be upgraded to allow | |||
| the transport of IRIs. In those cases where the document as a whole | the transport of IRIs. In those cases where the document as a whole | |||
| has a native character encoding, IRIs MUST also be encoded in this | has a native character encoding, IRIs MUST also be encoded in this | |||
| encoding, and converted accordingly by a parser or interpreter. IRI | encoding, and converted accordingly by a parser or interpreter. IRI | |||
| characters that are not expressible in the native encoding SHOULD be | characters that are not expressible in the native encoding SHOULD be | |||
| escaped using the escaping conventions of the document format if such | escaped using the escaping conventions of the document format if such | |||
| skipping to change at page 23, line 40 | skipping to change at page 24, line 23 | |||
| IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink | IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink | |||
| [XLink], and XML Schema [XMLSchema] and specifications based upon | [XLink], and XML Schema [XMLSchema] and specifications based upon | |||
| them allow IRIs. Also, it is expected that all relevant new W3C | them allow IRIs. Also, it is expected that all relevant new W3C | |||
| formats and protocols will be required to handle IRIs [CharMod]. | formats and protocols will be required to handle IRIs [CharMod]. | |||
| 6.4 Use of UTF-8 for Encoding Original Characters | 6.4 Use of UTF-8 for Encoding Original Characters | |||
| This section discusses details and gives examples for point c) in | This section discusses details and gives examples for point c) in | |||
| Section 1.2. In order to be able to use IRIs, the URI corresponding | Section 1.2. In order to be able to use IRIs, the URI corresponding | |||
| to the IRI in question has to encode original characters into octets | to the IRI in question has to encode original characters into octets | |||
| using UTF-8. This can be specified for all URIs of an URI scheme, or | using UTF-8. This can be specified for all URIs of a URI scheme, or | |||
| can apply to individual URIs for schemes that do not specify how to | can apply to individual URIs for schemes that do not specify how to | |||
| encode original characters. It can apply to the whole URI, or only | encode original characters. It can apply to the whole URI, or only | |||
| some part. | some part. | |||
| For new URI schemes, using UTF-8 is recommended in [RFC2718]. | For new URI schemes, using UTF-8 is recommended in [RFC2718]. | |||
| Examples where this is already used are the URN syntax [RFC2141], | Examples where this is already used are the URN syntax [RFC2141], | |||
| IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, the | IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, | |||
| HTTP URL scheme does not specify how to encode original characters, | because the HTTP URL scheme does not specify how to encode original | |||
| and therefore IRIs only can be used for some HTTP URLs. | characters, only some HTTP URLs can have corresponding but different | |||
| IRIs. | ||||
| For example, for a document with a URI of | For example, for a document with a URI of | |||
| http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to | http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to | |||
| construct a corresponding IRI (in XML notation, see Section 1.4): | construct a corresponding IRI (in XML notation, see Section 1.4): | |||
| http://www.example.org/résumé.html (é stands for the | http://www.example.org/résumé.html (é stands for the | |||
| e-acute character, and %C3%A9 is the UTF-8 encoded and escaped | e-acute character, and %C3%A9 is the UTF-8 encoded and escaped | |||
| representation of that character). On the other hand, for a document | representation of that character). On the other hand, for a document | |||
| with an URI of http://www.example.org/r%E9sum%E9.html, the escaped | with a URI of http://www.example.org/r%E9sum%E9.html, the escaped | |||
| octets cannot be converted to actual characters in an IRI, because | octets cannot be converted to actual characters in an IRI, because | |||
| the escaping is not based on UTF-8. | the escaping is not based on UTF-8. | |||
| The requirement for the use of UTF-8 applies to all parts of an URI, | The requirement for the use of UTF-8 applies to all parts of a URI, | |||
| with the exception of the ihostname part. However, it is possible | with the exception of the ihostname part. However, it is possible | |||
| that the capability of IRIs to represent a wide range of characters | that the capability of IRIs to represent a wide range of characters | |||
| directly is used just in some parts of the IRI (or IRI reference). | directly is used just in some parts of the IRI (or IRI reference). | |||
| The other parts of the IRI may only contain ASCII characters, or they | The other parts of the IRI may only contain ASCII characters, or they | |||
| may not be based on UTF-8. They may be based on another encoding, or | may not be based on UTF-8. They may be based on another encoding, or | |||
| they may directly encode raw binary data (see also [RFC2397]). | they may directly encode raw binary data (see also [RFC2397]). | |||
| For example, it is possible to have an URI reference of | For example, it is possible to have a URI reference of | |||
| http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9, where the | http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9, where the | |||
| document name is encoded in iso-8859-1 based on server settings, but | document name is encoded in iso-8859-1 based on server settings, but | |||
| the fragment identifier is encoded in UTF-8 according to [XPointer]. | the fragment identifier is encoded in UTF-8 according to [XPointer]. | |||
| The IRI corresponding to the above URI would be (in XML notation) | The IRI corresponding to the above URI would be (in XML notation) | |||
| http://www.example.org/r%E9sum%E9.xml#résumé. | http://www.example.org/r%E9sum%E9.xml#résumé. | |||
| @@@@ add something about query parts | Similar considerations apply to query parts. The functionality of | |||
| IRIs (namely to be able to include non-ASCII characters) can only be | ||||
| used if the query part is encoded in UTF-8. | ||||
| 6.5 Relative IRI References | 6.5 Relative IRI References | |||
| Processing of relative forms of IRIs against a base is handled | Processing of relative forms of IRIs against a base is handled | |||
| straightforwardly; the algorithms of [RFCYYYY] can be applied | straightforwardly; the algorithms of [RFCYYYY] can be applied | |||
| directly, treating the characters additionally allowed in IRIs in the | directly, treating the characters additionally allowed in IRIs in the | |||
| same way as unreserved characters in URIs. | same way as unreserved characters in URIs. | |||
| 7. URI/IRI Processing Guidelines (informative) | 7. URI/IRI Processing Guidelines (informative) | |||
| skipping to change at page 25, line 41 | skipping to change at page 26, line 27 | |||
| The process of IRI entry must assure, as far as possible, that the | The process of IRI entry must assure, as far as possible, that the | |||
| restrictions defined in Section 2.2 are met. This may be done by | restrictions defined in Section 2.2 are met. This may be done by | |||
| choosing appropriate input methods or variants/settings thereof, by | choosing appropriate input methods or variants/settings thereof, by | |||
| appropriately converting the characters being input, by eliminating | appropriately converting the characters being input, by eliminating | |||
| characters that cannot be converted, and/or by issuing a warning or | characters that cannot be converted, and/or by issuing a warning or | |||
| error message to the user. | error message to the user. | |||
| As an example of variant settings, input method editors for East | As an example of variant settings, input method editors for East | |||
| Asian Languages usually allow the input of Latin letters and related | Asian Languages usually allow the input of Latin letters and related | |||
| characters in full-width or half-width versions. For IRI input, the | characters in full-width or half-width versions. For IRI input, the | |||
| input method editor should be set to half-width input, in order to | input method editor should be set so that it produces half-width | |||
| produce US-ASCII characters where possible. | Latin letters, and full-width Katakana. | |||
| An input field primarily or only used for the input of URIs/IRIs | An input field primarily or only used for the input of URIs/IRIs may | |||
| should allow the user to view an IRI as mapped to a URI. Places | allow the user to view an IRI as mapped to a URI. Places where the | |||
| where the input of IRIs is frequent should provide the possibility | input of IRIs is frequent may provide the possibility for viewing an | |||
| for viewing an IRI as mapped to a URI. This will help users when | IRI as mapped to a URI. This will help users when some of the | |||
| some of the software they use does not yet accept IRIs. | software they use does not yet accept IRIs. | |||
| An IRI input component that interfaces to components that handle | An IRI input component that interfaces to components that handle | |||
| URIs, but not IRIs, must map the IRI to a URI before passing it to | URIs, but not IRIs, must map the IRI to a URI before passing it to | |||
| such a component. | such a component. | |||
| For the input of IRIs with right-to-left characters, please see | For the input of IRIs with right-to-left characters, please see | |||
| Section 4.3. | Section 4.3. | |||
| 7.3 URI/IRI Transfer Between Applications | 7.3 URI/IRI Transfer Between Applications | |||
| skipping to change at page 26, line 23 | skipping to change at page 27, line 8 | |||
| based on URI syntax. They then allow the user to click on such URIs | based on URI syntax. They then allow the user to click on such URIs | |||
| and retrieve the corresponding resource in an appropriate (usually | and retrieve the corresponding resource in an appropriate (usually | |||
| scheme-dependent) application. | scheme-dependent) application. | |||
| Such applications have to be upgraded to use the IRI syntax rather | Such applications have to be upgraded to use the IRI syntax rather | |||
| than the URI syntax as a base for heuristics. In particular, a non- | than the URI syntax as a base for heuristics. In particular, a non- | |||
| ASCII character should not be taken as the indication of the end of | ASCII character should not be taken as the indication of the end of | |||
| an IRI. Such applications also have to make sure that they correctly | an IRI. Such applications also have to make sure that they correctly | |||
| convert the detected IRI from the encoding of the document or | convert the detected IRI from the encoding of the document or | |||
| application where the IRI appears to the encoding used by the system- | application where the IRI appears to the encoding used by the system- | |||
| wide IRI invocation mechanism, or to an URI (according to Section | wide IRI invocation mechanism, or to a URI (according to Section 3.1) | |||
| 3.1) if the system-wide invocation mechanism only accepts URIs. | if the system-wide invocation mechanism only accepts URIs. | |||
| The clipboard is another frequently used way to transfer URIs and | The clipboard is another frequently used way to transfer URIs and | |||
| IRIs from one application to another. On most platforms, the | IRIs from one application to another. On most platforms, the | |||
| clipboard is able to store and transfer text in many languages and | clipboard is able to store and transfer text in many languages and | |||
| scripts. Correctly used, the clipboard transfers characters, not | scripts. Correctly used, the clipboard transfers characters, not | |||
| bytes, which will do the right thing with IRIs. | bytes, which will do the right thing with IRIs. | |||
| 7.4 URI/IRI Generation | 7.4 URI/IRI Generation | |||
| Systems that offer resources through the Internet, where those | Systems that offer resources through the Internet, where those | |||
| skipping to change at page 26, line 47 | skipping to change at page 27, line 32 | |||
| generate a directory listing for a file directory, and then respond | generate a directory listing for a file directory, and then respond | |||
| to the generated URIs with the files. | to the generated URIs with the files. | |||
| Many legacy character encodings are in use in various file systems. | Many legacy character encodings are in use in various file systems. | |||
| Many currently deployed systems do not transform the local character | Many currently deployed systems do not transform the local character | |||
| representation of the underlying system before generating URIs. | representation of the underlying system before generating URIs. | |||
| For maximum interoperability, systems that generate resource | For maximum interoperability, systems that generate resource | |||
| identifiers should do the appropriate transformations. For example, | identifiers should do the appropriate transformations. For example, | |||
| if a file system contains a file named résumé.html, a | if a file system contains a file named résumé.html, a | |||
| server should expose this as r%C3%A9sum%C3%A9.html in an URI, which | server should expose this as r%C3%A9sum%C3%A9.html in a URI, which | |||
| allows to use résumé.html in an IRI, even if the file name | allows to use résumé.html in an IRI, even if the file name | |||
| locally is kept in an encoding other than UTF-8. | locally is kept in an encoding other than UTF-8. | |||
| This recommendation in particular applies to HTTP servers. For FTP | This recommendation in particular applies to HTTP servers. For FTP | |||
| servers, similar considerations apply, see in particular [RFC2640]. | servers, similar considerations apply, see in particular [RFC2640]. | |||
| 7.5 URI/IRI Selection | 7.5 URI/IRI Selection | |||
| In some cases, resource owners and publishers have control over the | In some cases, resource owners and publishers have control over the | |||
| IRIs used to identify their resources. Such control is mostly | IRIs used to identify their resources. Such control is mostly | |||
| skipping to change at page 27, line 31 | skipping to change at page 28, line 17 | |||
| here. As long as names are limited to characters from a single | here. As long as names are limited to characters from a single | |||
| script, native writers of a given script or language will know best | script, native writers of a given script or language will know best | |||
| when ambiguities can appear, and how they can be avoided. What may | when ambiguities can appear, and how they can be avoided. What may | |||
| look ambiguous to a stranger may be completely obvious to the average | look ambiguous to a stranger may be completely obvious to the average | |||
| native user. On the other hand, in some cases, the UCS contains | native user. On the other hand, in some cases, the UCS contains | |||
| variants for compatibility reasons, for example for typographic | variants for compatibility reasons, for example for typographic | |||
| purposes. These should be avoided wherever possible. Although there | purposes. These should be avoided wherever possible. Although there | |||
| may be exceptions, in general newly created resource names should be | may be exceptions, in general newly created resource names should be | |||
| in NFKC [UTR15] (which means that they are also in NFC). | in NFKC [UTR15] (which means that they are also in NFC). | |||
| As an example, the UCS contains codepoint U+FB01 for the 'fi' | As an example, the UCS contains the 'fi' ligature at U+FB01 for | |||
| ligature for compatibility reasons. Wherever possible, IRIs should | compatibility reasons. Wherever possible, IRIs should use the two | |||
| use the two letters 'f' and 'i' rather than the 'fi' ligature. An | letters 'f' and 'i' rather than the 'fi' ligature. An example where | |||
| example where the later may be used is in the query part of an IRI | the latter may be used is in the query part of an IRI for an explicit | |||
| for an explicit search for a word containing the 'fi' ligature. | search for a word written containing the 'fi' ligature. | |||
| In certain cases, there is a chance that characters from different | In certain cases, there is a chance that characters from different | |||
| scripts look the same. The best known example is the Latin 'A', the | scripts look the same. The best known example is the Latin 'A', the | |||
| Greek 'Alpha', and the Cyrillic 'A'. To avoid such cases, only IRIs | Greek 'Alpha', and the Cyrillic 'A'. To avoid such cases, only IRIs | |||
| should be generated where all the characters in a single component | should be generated where all the characters in a single component | |||
| are used together in a given language. This usually means that all | are used together in a given language. This usually means that all | |||
| these characters will be from the same script, but there are | these characters will be from the same script, but there are | |||
| languages that mix characters from different scripts (such as | languages that mix characters from different scripts (such as | |||
| Japanese). This is similar to the heuristics used to distinguish | Japanese). This is similar to the heuristics used to distinguish | |||
| between letters and numbers in the examples above. Also, for Latin, | between letters and numbers in the examples above. Also, for Latin, | |||
| skipping to change at page 28, line 21 | skipping to change at page 29, line 6 | |||
| Software that interprets IRIs as the names of local resources should | Software that interprets IRIs as the names of local resources should | |||
| accept IRIs in multiple forms, and convert and match them with the | accept IRIs in multiple forms, and convert and match them with the | |||
| appropriate local resource names. | appropriate local resource names. | |||
| First, multiple representations include both IRIs in the native | First, multiple representations include both IRIs in the native | |||
| character encoding of the protocol and also their URI counterparts. | character encoding of the protocol and also their URI counterparts. | |||
| Second, it may include URIs constructed based on other character | Second, it may include URIs constructed based on other character | |||
| encodings than UTF-8. Such URIs may be produced by user agents that | encodings than UTF-8. Such URIs may be produced by user agents that | |||
| do not conform to this specification and use legacy encodings to | do not conform to this specification and use legacy encodings to | |||
| convert non-ASCII characters to URIs. Whether this is necessary, and | convert non-ASCII characters to URIs. Whether this is necessary and | |||
| what character encodings to cover, depends on a number of factors, | what character encodings to cover, depends on a number of factors, | |||
| such as the legacy character encodings used locally and the | such as the legacy character encodings used locally and the | |||
| distribution of various versions of user agents. For example, | distribution of various versions of user agents. For example, | |||
| software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in | software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in | |||
| addition to UTF-8. | addition to UTF-8. | |||
| Third, it may include additional mappings to be more user-friendly | Third, it may include additional mappings to be more user-friendly | |||
| and robust against transmission errors. These would be similar to | and robust against transmission errors. These would be similar to | |||
| how currently some servers treat URIs as case-insensitive, or perform | how currently some servers treat URIs as case-insensitive, or perform | |||
| additional matching to account for spelling errors. For characters | additional matching to account for spelling errors. For characters | |||
| skipping to change at page 29, line 47 | skipping to change at page 30, line 32 | |||
| pass some security tests and then be interpreted as '/..' in a path | pass some security tests and then be interpreted as '/..' in a path | |||
| if UTF-8 decoders are fault-tolerant, if conversion and checking are | if UTF-8 decoders are fault-tolerant, if conversion and checking are | |||
| not done in the right order, and/or if reserved characters and | not done in the right order, and/or if reserved characters and | |||
| unreserved characters are not clearly distinguished. | unreserved characters are not clearly distinguished. | |||
| There are various ways in which "spoofing" can occur with IRIs. | There are various ways in which "spoofing" can occur with IRIs. | |||
| "Spoofing" means that somebody may add a resource name that looks the | "Spoofing" means that somebody may add a resource name that looks the | |||
| same or similar to the user, but points to a different resource. The | same or similar to the user, but points to a different resource. The | |||
| added resource may pretend to be the real resource by looking very | added resource may pretend to be the real resource by looking very | |||
| similar, but may contain all kinds of changes that may be difficult | similar, but may contain all kinds of changes that may be difficult | |||
| to spot but can cause all kinds of problems. Most spoofing | to spot and can cause all kinds of problems. Most spoofing | |||
| possibilities for IRIs are extensions of those for URIs. | possibilities for IRIs are extensions of those for URIs. | |||
| Spoofing can occur for various reasons. A first reason is that | Spoofing can occur for various reasons. A first reason is that | |||
| normalization expectations of a user or actual normalization when | normalization expectations of a user or actual normalization when | |||
| entering an IRI, or when transcoding an IRI from a legacy encoding, | entering an IRI, or when transcoding an IRI from a legacy encoding, | |||
| do not match the normalization used on the server side. | do not match the normalization used on the server side. | |||
| Conceptually, this is no different from the problems surrounding the | Conceptually, this is no different from the problems surrounding the | |||
| use of case-insensitive web servers. For example, a popular web page | use of case-insensitive web servers. For example, a popular web page | |||
| with a mixed case name (http://big.site/PopularPage.html) might be | with a mixed case name (http://big.site/PopularPage.html) might be | |||
| "spoofed" by someone who is able to create http://big.site/ | "spoofed" by someone who is able to create http://big.site/ | |||
| popularpage.html. However, the introduction of character | popularpage.html. However, the introduction of character | |||
| normalization, and of additional mappings for user convenience, may | normalization, and of additional mappings for user convenience, may | |||
| increase the chance for spoofing. Protocols and servers that allow | increase the chance for spoofing. Protocols and servers that allow | |||
| the creation of resources with unnormalized names, and resources with | the creation of resources with unnormalized names, and resources with | |||
| names that are not normalized, are particularly vulnerable to such | names that are not normalized, are particularly vulnerable to such | |||
| attacks. This is an inherent security problem of the relevant | attacks. This is an inherent security problem of the relevant | |||
| protocol, server, or resource, and not specific to IRIs, but | protocol, server, or resource, and not specific to IRIs, but | |||
| mentioned here for completeness. | mentioned here for completeness. | |||
| Spoofing can occur in various IRI components, such as the domain name | ||||
| part or a path part. For considerations specific to the domain name | ||||
| part, see [RFC3491]. For the path part, administrators of sites | ||||
| which allow independent users to create resources in the same subarea | ||||
| may need to be careful to check for spoofing. | ||||
| Spoofing can occur because in the UCS, there are many characters that | Spoofing can occur because in the UCS, there are many characters that | |||
| look very similar. Details are discussed in Section 7.5. Again, | look very similar. Details are discussed in Section 7.5. Again, | |||
| this is very similar to spoofing possibilities on US-ASCII, e.g. | this is very similar to spoofing possibilities on US-ASCII, e.g. | |||
| using 'br0ken' or '1ame' URIs. | using 'br0ken' or '1ame' URIs. | |||
| Spoofing can occur when URIs in various encodings are accepted to | Spoofing can occur when URIs in various encodings are accepted to | |||
| deal with older user agents. In some cases, in particular for Latin- | deal with older user agents. In some cases, in particular for Latin- | |||
| based resource names, this is usually easy to detect because UTF-8- | based resource names, this is usually easy to detect because UTF-8- | |||
| encoded names, when interpreted and viewed as legacy encodings, | encoded names, when interpreted and viewed as legacy encodings, | |||
| produce mostly garbage. In other cases, when concurrently used | produce mostly garbage. In other cases, when concurrently used | |||
| encodings have a similar structure, but there are no characters that | encodings have a similar structure, but there are no characters that | |||
| have exactly the same encoding, detection is more difficult. | have exactly the same encoding, detection is more difficult. | |||
| Spoofing can occur in various IRI components, such as the domain name | ||||
| part or a path part. For considerations specific to the domain name | ||||
| part, see [RFC3491]. For the path part, administrators of sites | ||||
| which allow independent users to create resources in the same subarea | ||||
| may need to be careful to check for spoofing. | ||||
| Spoofing can occur with bidirectional IRIs, if the restrictions in | Spoofing can occur with bidirectional IRIs, if the restrictions in | |||
| Section 4.2 are not followed. The same visual representation may be | Section 4.2 are not followed. The same visual representation may be | |||
| interpreted as different logical representations, and vice versa. It | interpreted as different logical representations, and vice versa. It | |||
| is also very important that a correct Unicode bidirectional | is also very important that a correct Unicode bidirectional | |||
| implementation is used. | implementation is used. | |||
| 9. Acknowledgements | 9. Acknowledgements | |||
| We would like to thank Larry Masinter for his work as coauthor of | We would like to thank Larry Masinter for his work as coauthor of | |||
| many earlier versions of this document (draft-masinter-url-i18n-xx). | many earlier versions of this document (draft-masinter-url-i18n-xx). | |||
| skipping to change at page 31, line 12 | skipping to change at page 31, line 46 | |||
| ago. There was a thread in the HTML working group in August 1995 | ago. There was a thread in the HTML working group in August 1995 | |||
| (under the topic of "Globalizing URIs") and in the www-international | (under the topic of "Globalizing URIs") and in the www-international | |||
| mailing list in July 1996 (under the topic of "Internationalization | mailing list in July 1996 (under the topic of "Internationalization | |||
| and URLs"), and ad-hoc meetings at the Unicode conferences in | and URLs"), and ad-hoc meetings at the Unicode conferences in | |||
| September 1995 and September 1997. | September 1995 and September 1997. | |||
| Thanks to Francois Yergeau, Matti Allouche, Roy Fielding, Tim | Thanks to Francois Yergeau, Matti Allouche, Roy Fielding, Tim | |||
| Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim | Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim | |||
| Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie | Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie | |||
| Daigle, Ted Hardie, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex | Daigle, Ted Hardie, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex | |||
| Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Dan | Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam | |||
| Oscarson, Elliotte Rusty Harold, Mike J. Brown, Simon Josefsson, | Costello, Dan Oscarson, Elliotte Rusty Harold, Mike J. Brown, Andrea | |||
| Carlos Viegas Damasio, and many others for help with understanding | Vine, Roy Badami, Simon Josefsson, Carlos Viegas Damasio, and many | |||
| the issues and possible solutions, and getting the details right. | others for help with understanding the issues and possible solutions, | |||
| Thanks also to the members of the W3C I18N Working Group and Interest | and getting the details right. Thanks also to the members of the W3C | |||
| Group for their contributions and their work on [CharMod], to the | I18N Working Group and Interest Group for their contributions and | |||
| members of many other W3C WGs for adopting the ideas, and to the | their work on [CharMod], to the members of many other W3C WGs for | |||
| members of the Montreal IAB Workshop on Internationalization and | adopting the ideas, and to the members of the Montreal IAB Workshop | |||
| Localization for their review. | on Internationalization and Localization for their review. | |||
| Normative References | Normative References | |||
| [ISO10646] International Organization for Standardization, | [ISO10646] International Organization for Standardization, | |||
| "Information Technology - Universal Multiple-Octet Coded | "Information Technology - Universal Multiple-Octet Coded | |||
| Character Set (UCS) - Part 1: Architecture and Basic | Character Set (UCS) - Part 1: Architecture and Basic | |||
| Multilingual Plane - Part 2: Supplementary Planes", ISO | Multilingual Plane - Part 2: Supplementary Planes", ISO | |||
| Standard 10646, with amendment, July 2002. | Standard 10646, with amendment, July 2002. | |||
| [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | |||
| skipping to change at page 32, line 14 | skipping to change at page 32, line 46 | |||
| [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | |||
| Unicode Standard Annex #15, March 2001, <http:// | Unicode Standard Annex #15, March 2001, <http:// | |||
| www.unicode.org/unicode/reports/tr15/tr15-21.html>. | www.unicode.org/unicode/reports/tr15/tr15-21.html>. | |||
| Non-normative References | Non-normative References | |||
| [BidiEx] "Examples of bidirectional IRIs", <http://www.w3.org/ | [BidiEx] "Examples of bidirectional IRIs", <http://www.w3.org/ | |||
| International/iri-edit/BidiExamples>. | International/iri-edit/BidiExamples>. | |||
| [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., | [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M. and T. | |||
| Freytag, A. and T. Texin, "Character Model for the | Texin, "Character Model for the World Wide Web", | |||
| World Wide Web", World Wide Web Consortium Working | World Wide Web Consortium Working Draft, August 2003, | |||
| Draft, April 2002, <http://www.w3.org/TR/charmod>. | <http://www.w3.org/TR/charmod>. | |||
| [Duerst97] Duerst, M., "The Properties and Promises of UTF-8", | [Duerst97] Duerst, M., "The Properties and Promises of UTF-8", | |||
| Proc. 11th International Unicode Conference, San Jose | Proc. 11th International Unicode Conference, San Jose | |||
| , September 1997, <http://www.ifi.unizh.ch/mml/ | , September 1997, <http://www.ifi.unizh.ch/mml/ | |||
| mduerst/papers/PDF/IUC11-UTF-8.pdf>. | mduerst/papers/PDF/IUC11-UTF-8.pdf>. | |||
| [Duerst01] Duerst, M., "Internationalized Resource Identifiers: | [Duerst01] Duerst, M., "Internationalized Resource Identifiers: | |||
| From Specification to Testing", Proc. 19th | From Specification to Testing", Proc. 19th | |||
| International Unicode Conference, San Jose , | International Unicode Conference, San Jose , | |||
| September 2001, <http://www.w3.org/2001/Talks/0912- | September 2001, <http://www.w3.org/2001/Talks/0912- | |||
| IUC-IRI/paper.html>. | IUC-IRI/paper.html>. | |||
| [Gettys] Gettys, J., "URI Model Consequences", <http:// | ||||
| www.w3.org/DesignIssues/ModelConsequences>. | ||||
| [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 | [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 | |||
| Specification", World Wide Web Consortium | Specification", World Wide Web Consortium | |||
| Recommendation, December 1999, <http://www.w3.org/TR/ | Recommendation, December 1999, <http://www.w3.org/TR/ | |||
| REC-html40/appendix/notes.html#h-B.2>. | REC-html40/appendix/notes.html#h-B.2>. | |||
| [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
| Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
| [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, | [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, | |||
| H., Atkinson, R., Crispin, M. and P. Svanberg, "The | H., Atkinson, R., Crispin, M. and P. Svanberg, "The | |||
| skipping to change at page 34, line 30 | skipping to change at page 35, line 17 | |||
| Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever | Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever | |||
| possible, for example as "Dürst in XML and HTML.) | possible, for example as "Dürst in XML and HTML.) | |||
| World Wide Web Consortium | World Wide Web Consortium | |||
| 200 Technology Square | 200 Technology Square | |||
| Cambridge, MA 02139 | Cambridge, MA 02139 | |||
| U.S.A. | U.S.A. | |||
| Phone: +1 617 253 5509 | Phone: +1 617 253 5509 | |||
| Fax: +1 617 258 5999 | Fax: +1 617 258 5999 | |||
| EMail: duerst@w3.org | EMail: mailto:duerst@w3.org | |||
| URI: http://www.w3.org/People/D%C3%BCrst/ | URI: http://www.w3.org/People/D%C3%BCrst/ | |||
| (Note: This is the escaped form of an IRI.) | (Note: This is the escaped form of an IRI.) | |||
| Michel Suignard | Michel Suignard | |||
| Microsoft Corporation | Microsoft Corporation | |||
| One Microsoft Way | One Microsoft Way | |||
| Redmond, WA 98052 | Redmond, WA 98052 | |||
| U.S.A. | U.S.A. | |||
| End of changes. | ||||
This html diff was produced by rfcdiff 1.12, available from http://www.levkowetz.com/ietf/tools/rfcdiff/ | ||||