| draft-duerst-iri-06.txt | draft-duerst-iri-07.txt | |||
|---|---|---|---|---|
| Network Working Group M. Duerst | Network Working Group M. Duerst | |||
| Internet-Draft W3C | Internet-Draft W3C | |||
| Expires: August 15, 2004 M. Suignard | Expires: November 7, 2004 M. Suignard | |||
| Microsoft Corporation | Microsoft Corporation | |||
| February 15, 2004 | May 9, 2004 | |||
| Internationalized Resource Identifiers (IRIs) | Internationalized Resource Identifiers (IRIs) | |||
| draft-duerst-iri-06 | draft-duerst-iri-07 | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with | By submitting this Internet-Draft, I certify that any applicable | |||
| all provisions of Section 10 of RFC2026. | patent or other IPR claims of which I am aware have been disclosed, | |||
| and any of which I become aware will be disclosed, in accordance with | ||||
| RFC 3668. | ||||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that other | |||
| other groups may also distribute working documents as Internet- | groups may also distribute working documents as Internet-Drafts. | |||
| Drafts. | ||||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at http:// | The list of current Internet-Drafts can be accessed at http:// | |||
| www.ietf.org/ietf/1id-abstracts.txt. | www.ietf.org/ietf/1id-abstracts.txt. | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
| This Internet-Draft will expire on August 15, 2004. | This Internet-Draft will expire on November 7, 2004. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (2004). All Rights Reserved. | Copyright (C) The Internet Society (2004). All Rights Reserved. | |||
| Abstract | Abstract | |||
| This document defines a new protocol element, the Internationalized | This document defines a new protocol element, the Internationalized | |||
| Resource Identifier (IRI), as a complement to the URI [RFCYYYY]. An | Resource Identifier (IRI), as a complement to the Uniform Resource | |||
| IRI is a sequence of characters from the Universal Character Set | Identifier (URI). An IRI is a sequence of characters from the | |||
| [ISO10646]. A mapping from IRIs to URIs is defined, which means that | Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to | |||
| IRIs can be used instead of URIs where appropriate to identify | URIs is defined, which means that IRIs can be used instead of URIs | |||
| resources. | where appropriate to identify resources. | |||
| The approach of defining a new protocol element was chosen, instead | The approach of defining a new protocol element was chosen, instead | |||
| of extending or changing the definition of URIs, to allow a clear | of extending or changing the definition of URIs, to allow a clear | |||
| distinction and to avoid incompatibilities with existing software. | distinction and to avoid incompatibilities with existing software. | |||
| Guidelines for the use and deployment of IRIs in various protocols, | Guidelines for the use and deployment of IRIs in various protocols, | |||
| formats, and software components that now deal with URIs are | formats, and software components that now deal with URIs are | |||
| provided. | provided. | |||
| NOTE | Editorial Note | |||
| This document is a product of the Internationalization Working Group | This document is a product of the Internationalization Working Group | |||
| (I18N WG) of the World Wide Web Consortium (W3C). For general | (I18N WG) of the World Wide Web Consortium (W3C). For general | |||
| discussion, please use the public-iri@w3.org mailing list (publicly | discussion, please use the public-iri@w3.org mailing list (publicly | |||
| archived at http://lists.w3.org/Archives/Public/public-iri/). An | archived at http://lists.w3.org/Archives/Public/public-iri/). An | |||
| issues list for this document is maintained at http://www.w3.org/ | issues list for this document is maintained at http://www.w3.org/ | |||
| International/iri-edit#issues. For more information on the topic of | International/iri-edit#issues. For more information on the topic of | |||
| this document, please also see [W3CIRI] and [Duerst01]. | this document, please also see [W3CIRI] and [Duerst01]. | |||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . . 4 | 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . 4 | |||
| 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . . 4 | 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 5 | 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 7 | |||
| 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . . 7 | 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 7 | |||
| 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . . 7 | 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . 8 | |||
| 3. Relationship between IRIs and URIs . . . . . . . . . . . . . 9 | 3. Relationship between IRIs and URIs . . . . . . . . . . . . . . 10 | |||
| 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . . 10 | 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . 10 | |||
| 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . . 13 | 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . 13 | |||
| 3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 14 | 3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . 15 | |||
| 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . 15 | 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . . 16 | |||
| 4.1 Logical Storage and Visual Presentation . . . . . . . . . . 16 | 4.1 Logical Storage and Visual Presentation . . . . . . . . . 17 | |||
| 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . 17 | 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . 18 | |||
| 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . 18 | 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . 19 | |||
| 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 18 | 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 19 | |||
| 5. IRI Equivalence and Comparison . . . . . . . . . . . . . . . 20 | 5. IRI Equivalence and Comparison . . . . . . . . . . . . . . . . 21 | |||
| 5.1 Simple String Comparison . . . . . . . . . . . . . . . . . . 20 | 5.1 Simple String Comparison . . . . . . . . . . . . . . . . . 21 | |||
| 5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . . 21 | 5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . 22 | |||
| 5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . 21 | 5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . . 22 | 5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . 23 | |||
| 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . 23 | 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 24 | |||
| 6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . . 23 | 6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . 24 | |||
| 6.2 Software Interfaces and Protocols . . . . . . . . . . . . . 23 | 6.2 Software Interfaces and Protocols . . . . . . . . . . . . 24 | |||
| 6.3 Format of URIs and IRIs in Documents and Protocols . . . . . 23 | 6.3 Format of URIs and IRIs in Documents and Protocols . . . . 25 | |||
| 6.4 Use of UTF-8 for Encoding Original Characters . . . . . . . 24 | 6.4 Use of UTF-8 for Encoding Original Characters . . . . . . 25 | |||
| 6.5 Relative IRI References . . . . . . . . . . . . . . . . . . 25 | 6.5 Relative IRI References . . . . . . . . . . . . . . . . . 26 | |||
| 7. URI/IRI Processing Guidelines (informative) . . . . . . . . 25 | 7. URI/IRI Processing Guidelines (informative) . . . . . . . . . 26 | |||
| 7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . . 25 | 7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . 26 | |||
| 7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . . 26 | 7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 27 | |||
| 7.3 URI/IRI Transfer Between Applications . . . . . . . . . . . 26 | 7.3 URI/IRI Transfer Between Applications . . . . . . . . . . 28 | |||
| 7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . . 27 | 7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 28 | |||
| 7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . . 27 | 7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 29 | |||
| 7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . . 28 | 7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 29 | |||
| 7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . . 28 | 7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . 30 | |||
| 7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . . 29 | 7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 30 | |||
| 8. Security Considerations . . . . . . . . . . . . . . . . . . 30 | 8. Security Considerations . . . . . . . . . . . . . . . . . . . 31 | |||
| 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 31 | 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 33 | |||
| Normative References . . . . . . . . . . . . . . . . . . . . 32 | 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 33 | |||
| Non-normative References . . . . . . . . . . . . . . . . . . 32 | 10.1 Normative References . . . . . . . . . . . . . . . . . . . . 33 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 35 | 10.2 Non-normative References . . . . . . . . . . . . . . . . . . 34 | |||
| Full Copyright Statement . . . . . . . . . . . . . . . . . . 36 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 36 | |||
| A. Design Alternatives . . . . . . . . . . . . . . . . . . . . . 37 | ||||
| A.1 New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 37 | ||||
| A.2 Other Character Encodings than UTF-8 . . . . . . . . . . . 37 | ||||
| A.3 New Encoding Convention . . . . . . . . . . . . . . . . . 38 | ||||
| A.4 Indicating Character Encodings in the URI/IRI . . . . . . 38 | ||||
| Intellectual Property and Copyright Statements . . . . . . . . 39 | ||||
| 1. Introduction | 1. Introduction | |||
| 1.1 Overview and Motivation | 1.1 Overview and Motivation | |||
| A URI is defined in [RFCYYYY] as a sequence of characters chosen from | A URI is defined in [RFCYYYY] as a sequence of characters chosen from | |||
| a limited subset of the repertoire of US-ASCII characters. | a limited subset of the repertoire of US-ASCII characters. | |||
| The characters in URIs are frequently used for representing words of | The characters in URIs are frequently used for representing words of | |||
| natural languages. Such usage has many advantages: such URIs are | natural languages. Such usage has many advantages: such URIs are | |||
| skipping to change at page 4, line 50 | skipping to change at page 4, line 50 | |||
| 1.2 Applicability | 1.2 Applicability | |||
| IRIs are designed to be compatible with recent recommendations for | IRIs are designed to be compatible with recent recommendations for | |||
| new URI schemes [RFC2718]. The compatibility is provided by | new URI schemes [RFC2718]. The compatibility is provided by | |||
| specifying a well defined and deterministic mapping from the IRI | specifying a well defined and deterministic mapping from the IRI | |||
| character sequence to the functionally equivalent URI character | character sequence to the functionally equivalent URI character | |||
| sequence. Practical use of IRIs (or IRI references) in place of URIs | sequence. Practical use of IRIs (or IRI references) in place of URIs | |||
| (or URI references) depends on the following conditions being met: | (or URI references) depends on the following conditions being met: | |||
| a) The protocol or format element used should be explicitly | a) The protocol or format element used should be explicitly | |||
| designated to carry IRIs. That is, the intent is not to | designated to carry IRIs. That is, the intent is not to introduce | |||
| introduce IRIs into contexts that are not defined to accept | IRIs into contexts that are not defined to accept them. For | |||
| them. For example, XML schema [XMLSchema] has an explicit type | example, XML schema [XMLSchema] has an explicit type "anyURI" that | |||
| "anyURI" that designates the use of IRIs. | designates the use of IRIs. | |||
| b) The protocol or format carrying the IRIs should have a | b) The protocol or format carrying the IRIs should have a mechanism | |||
| mechanism to represent the wide range of characters used in | to represent the wide range of characters used in IRIs, either | |||
| IRIs, either natively or by some protocol- or format-specific | natively or by some protocol- or format-specific escaping | |||
| escaping mechanism (for example numeric character references in | mechanism (for example numeric character references in [XML1]). | |||
| [XML1]). | ||||
| c) The URI corresponding to the IRI in question has to encode | c) The URI corresponding to the IRI in question has to encode | |||
| original characters into octets using UTF-8. For new URI | original characters into octets using UTF-8. For new URI schemes, | |||
| schemes, this is recommended in [RFC2718]. It can apply to a | this is recommended in [RFC2718]. It can apply to a whole scheme | |||
| whole scheme (e.g. IMAP URLs [RFC2192] and POP URLs [RFC2384], | (e.g. IMAP URLs [RFC2192] and POP URLs [RFC2384], or the URN | |||
| or the URN syntax [RFC2141]). It can apply to a specific part | syntax [RFC2141]). It can apply to a specific part of a URI, such | |||
| of a URI, such as the fragment identifier (e.g. [XPointer]). | as the fragment identifier (e.g. [XPointer]). It can apply to a | |||
| It can apply to a specific URI or part(s) thereof. For | specific URI or part(s) thereof. For details, please see Section | |||
| details, please see Section 6.4. | 6.4. | |||
| 1.3 Definitions | 1.3 Definitions | |||
| The following definitions are used in this document; they follow the | The following definitions are used in this document; they follow the | |||
| terms in [RFC2130], [RFC2277] and [ISO10646]: | terms in [RFC2130], [RFC2277] and [ISO10646]: | |||
| character: A member of a set of elements used for the | character: A member of a set of elements used for the organization, | |||
| organization, control, or representation of data. For example, | control, or representation of data. For example, "LATIN CAPITAL | |||
| "LATIN CAPITAL LETTER A" names a character. | LETTER A" names a character. | |||
| octet: An ordered sequence of eight bits considered as a unit | octet: An ordered sequence of eight bits considered as a unit | |||
| character repertoire: A set of characters (in the mathematical | character repertoire: A set of characters (in the mathematical sense) | |||
| sense) | ||||
| sequence of characters: A sequence (one after another) of | sequence of characters: A sequence (one after another) of characters | |||
| characters | ||||
| sequence of octets: A sequence (one after another) of octets | sequence of octets: A sequence (one after another) of octets | |||
| (character) encoding: A method of representing a sequence of | character encoding: A method of representing a sequence of characters | |||
| characters as a sequence of octets (maybe with variants). A | as a sequence of octets (maybe with variants). A method of | |||
| method of (unambiguously) converting a sequence of octets into | (unambiguously) converting a sequence of octets into a sequence of | |||
| a sequence of characters. | characters. | |||
| charset: The name of a parameter or attribute used to identify a | charset: The name of a parameter or attribute used to identify a | |||
| character encoding. | character encoding. | |||
| UCS: Universal Character Set; the coded character set defined by | UCS: Universal Character Set; the coded character set defined by ISO/ | |||
| [ISO10646] and [UNIV4]. | IEC 10646 [ISO10646] and the Unicode Standard [UNIV4]. | |||
| IRI reference: The term "IRI reference" denotes the common usage | IRI reference: The term "IRI reference" denotes the common usage of | |||
| of an internationalized resource identifier. An IRI reference | an internationalized resource identifier. An IRI reference may be | |||
| may be absolute or relative. However, the "IRI" that results | absolute or relative. However, the "IRI" that results from such a | |||
| from such a reference only includes absolute IRIs; any relative | reference only includes absolute IRIs; any relative IRIs are | |||
| IRIs are resolved to their absolute form. Note that in | resolved to their absolute form. Note that in [RFC2396], URIs did | |||
| [RFC2396], URIs did not include fragment identifiers, but in | not include fragment identifiers, but in [RFCYYYY], fragment | |||
| [RFCYYYY], fragment identifiers are part of URIs. | identifiers are part of URIs. | |||
| running text: Human text (paragraphs, sentences, phrases) with | running text: Human text (paragraphs, sentences, phrases) with syntax | |||
| syntax according to orthographic conventions of a natural | according to orthographic conventions of a natural language, as | |||
| language, as opposed to syntax defined for ease of processing | opposed to syntax defined for ease of processing by machines | |||
| by machines (markup, programming languages,...). | (markup, programming languages,...). | |||
| protocol element: Any portion of a message which affects processing | ||||
| of that message by the protocol in question. | ||||
| presentation element: Presentation form corresponding to a protocol | ||||
| element, for example using a wider range of characters. | ||||
| create (an URI or IRI): With respect to URIs and IRIs, the word | ||||
| 'create' is used for the initial creation. This may be the initial | ||||
| creation of a resource with a certain name, or the initial | ||||
| exposition of a resource under a particular name. | ||||
| generate (an URI or IRI): With respect to URIs and IRIs, the word | ||||
| 'generate' is used when the IRI is generated by derivation from | ||||
| other information. | ||||
| 1.4 Notation | 1.4 Notation | |||
| RFCs and Internet Drafts currently do not allow any characters | RFCs and Internet Drafts currently do not allow any characters | |||
| outside the US-ASCII repertoire. Therefore, this document uses | outside the US-ASCII repertoire. Therefore, this document uses | |||
| various special notations to denote such characters in examples. | various special notations to denote such characters in examples. | |||
| In text, characters outside US-ASCII are sometimes referenced by | In text, characters outside US-ASCII are sometimes referenced by | |||
| using a prefix of 'U+', followed by four to six hexadecimal digits. | using a prefix of 'U+', followed by four to six hexadecimal digits. | |||
| skipping to change at page 6, line 40 | skipping to change at page 7, line 5 | |||
| XML Notation uses leading '&#x', trailing ';', and the hexadecimal | XML Notation uses leading '&#x', trailing ';', and the hexadecimal | |||
| number of the character in the UCS in between. Example: я | number of the character in the UCS in between. Example: я | |||
| stands for CYRILLIC CAPITAL LETTER YA. In this notation, an actual | stands for CYRILLIC CAPITAL LETTER YA. In this notation, an actual | |||
| '&' is denoted by '&'. | '&' is denoted by '&'. | |||
| Bidi Notation is used for bidirectional examples: lower case ASCII | Bidi Notation is used for bidirectional examples: lower case ASCII | |||
| letters stand for Latin letters or other letters that are written | letters stand for Latin letters or other letters that are written | |||
| left-to-right, whereas upper case letters represent Arabic or Hebrew | left-to-right, whereas upper case letters represent Arabic or Hebrew | |||
| letters that are written right-to-left. | letters that are written right-to-left. | |||
| To denote actual octets in examples (as opposed to escaped octets), | To denote actual octets in examples (as opposed to percent-encoded | |||
| the two hex digits denoting the octet are enclosed in "<" and ">". | octets), the two hex digits denoting the octet are enclosed in "<" | |||
| For example, the octet often denoted as 0xc9 is denoted here as <c9>. | and ">". For example, the octet often denoted as 0xc9 is denoted here | |||
| as <c9>. | ||||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
| document are to be interpreted as described in [RFC2119]. | document are to be interpreted as described in [RFC2119]. | |||
| 2. IRI Syntax | 2. IRI Syntax | |||
| This section defines the syntax of Internationalized Resource | This section defines the syntax of Internationalized Resource | |||
| Identifiers (IRIs). | Identifiers (IRIs). | |||
| skipping to change at page 7, line 31 | skipping to change at page 7, line 45 | |||
| limitations given in the syntax rules below and in Section 6.1. | limitations given in the syntax rules below and in Section 6.1. | |||
| Otherwise, the syntax and use of components and reserved characters | Otherwise, the syntax and use of components and reserved characters | |||
| is the same as that in [RFCYYYY]. All the operations defined in | is the same as that in [RFCYYYY]. All the operations defined in | |||
| [RFCYYYY], such as the resolution of relative URIs, can be applied to | [RFCYYYY], such as the resolution of relative URIs, can be applied to | |||
| IRIs by IRI-processing software in exactly the same way as this is | IRIs by IRI-processing software in exactly the same way as this is | |||
| done to URIs by URI-processing software. | done to URIs by URI-processing software. | |||
| Characters outside the US-ASCII range are not reserved and therefore | Characters outside the US-ASCII range are not reserved and therefore | |||
| MUST NOT be used for syntactical purposes such as to delimit | MUST NOT be used for syntactical purposes such as to delimit | |||
| components in newly defined schemes. As an example, it is not | components in newly defined schemes. As an example, it is not allowed | |||
| allowed to use U+00A2, CENT SIGN, as a delimiter in IRIs, because it | to use U+00A2, CENT SIGN, as a delimiter in IRIs, because it is in | |||
| is in the 'iunreserved' category, in the same way as it is not | the 'iunreserved' category, in the same way as it is not possible to | |||
| possible to use '-' as a delimiter, because it is in the 'unreserved' | use '-' as a delimiter, because it is in the 'unreserved' category in | |||
| category in URIs. | URIs. | |||
| 2.2 ABNF for IRI References and IRIs | 2.2 ABNF for IRI References and IRIs | |||
| While it might be possible to define IRI references and IRIs merely | While it might be possible to define IRI references and IRIs merely | |||
| by their transformation to URI references and URIs, they can also be | by their transformation to URI references and URIs, they can also be | |||
| accepted and processed directly. Therefore, an ABNF definition for | accepted and processed directly. Therefore, an ABNF definition for | |||
| IRI references (which are the most general concept and the start of | IRI references (which are the most general concept and the start of | |||
| the grammar) and IRIs is given here. The syntax of this ABNF is | the grammar) and IRIs is given here. The syntax of this ABNF is | |||
| described in [RFC2234]. Character numbers are taken from the UCS, | described in [RFC2234]. Character numbers are taken from the UCS, | |||
| without implying any actual binary encoding. Terminals in the ABNF | without implying any actual binary encoding. Terminals in the ABNF | |||
| are characters, not bytes. | are characters, not bytes. | |||
| The following rules are different from [RFCYYYY]: | The following rules are different from [RFCYYYY]: | |||
| IRI = scheme ":" ["//" iauthority] ipath ["?" iquery] | IRI = scheme ":" ihier-part [ "?" iquery ] | |||
| ["#" ifragment] | ["#" ifragment] | |||
| IRI-reference = IRI / relative-IRI | ||||
| relative-IRI = ["//" iauthority] ipath ["?" iquery] | ihier-part = "//" iauthority ipath-abempty | |||
| ["#" ifragment] | / ipath-abs | |||
| / ipath-rootless | ||||
| / ipath-empty | ||||
| absolute-IRI = scheme ":" ["//" iauthority] ipath ["?" iquery] | IRI-reference = IRI / relative-IRI | |||
| iauthority = [ iuserinfo "@" ] ihost [ ":" port ] | absolute-IRI = scheme ":" ihier-part [ "?" iquery ] | |||
| iuserinfo = *( iunreserved / pct-encoded / sub-delims | relative-IRI = irelative-part [ "?" iquery ] [ "#" ifragment ] | |||
| / ":" ) | ||||
| irelative-part = "//" iauthority ipath-abempty | ||||
| / ipath-abs | ||||
| / ipath-noscheme | ||||
| / ipath-empty | ||||
| iauthority = [ iuserinfo "@" ] ihost [ ":" port ] | ||||
| iuserinfo = *( iunreserved / pct-encoded / sub-delims / ":" ) | ||||
| ihost = IP-literal / IPv4address / ireg-name | ihost = IP-literal / IPv4address / ireg-name | |||
| ireg-name = 0*255( iunreserved / pct-encoded / sub-delims ) | ireg-name = 0*255( iunreserved / pct-encoded / sub-delims ) | |||
| ipath = isegment *( "/" isegment ) | ipath = ipath-abempty ; begins with "/" or is empty | |||
| / ipath-abs ; begins with "/" but not "//" | ||||
| / ipath-noscheme ; begins with a non-colon segment | ||||
| / ipath-rootless ; begins with a segment | ||||
| / ipath-empty ; zero characters | ||||
| ipath-abempty = *( "/" isegment ) | ||||
| ipath-abs = "/" [ isegment-nz *( "/" isegment ) ] | ||||
| ipath-noscheme = isegment-nzc *( "/" isegment ) | ||||
| ipath-rootless = isegment-nz *( "/" isegment ) | ||||
| ipath-empty = 0<ipchar> | ||||
| isegment = *ipchar | isegment = *ipchar | |||
| isegment-nz = 1*ipchar | ||||
| isegment-nzc = 1*( iunreserved / pct-encoded / sub-delims | ||||
| / "@" ) | ||||
| ipchar = iunreserved / pct-encoded / sub-delims / ":" | ||||
| / "@" | ||||
| iquery = *( ipchar / iprivate / "/" / "?" ) | iquery = *( ipchar / iprivate / "/" / "?" ) | |||
| ifragment = *( ipchar / "/" / "?" ) | ifragment = *( ipchar / "/" / "?" ) | |||
| ipchar = iunreserved / pct-encoded / sub-delims / ":" | ||||
| / "@" | ||||
| iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar | iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar | |||
| ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / | ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF | |||
| / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD | / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD | |||
| / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD | / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD | |||
| / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD | / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD | |||
| / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD | / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD | |||
| / %xD0000-DFFFD / %xE1000-EFFFD | / %xD0000-DFFFD / %xE1000-EFFFD | |||
| iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD | iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD | |||
| Some productions ambiguous. The "first-match-wins" (a.k.a. | Some productions are ambiguous. The "first-match-wins" (a.k.a. | |||
| "greedy") algorithm applies. For details, see [RFCYYYY]. | "greedy") algorithm applies. For details, see [RFCYYYY]. | |||
| The following are the same as [RFCYYYY]: | The following are the same as in [RFCYYYY]: | |||
| scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | |||
| port = *DIGIT | port = *DIGIT | |||
| IP-literal = "[" ( IPv6address | IPvFuture ) "]" | IP-literal = "[" ( IPv6address / IPvFuture ) "]" | |||
| IPvFuture = "v" HEXDIG "." 1*( unreserved / sub-delims / ":" ) | ||||
| IPv6address = 6( h4 ":" ) ls32 | ||||
| / "::" 5( h4 ":" ) ls32 | ||||
| / [ h4 ] "::" 4( h4 ":" ) ls32 | ||||
| / [ *1( h4 ":" ) h4 ] "::" 3( h4 ":" ) ls32 | ||||
| / [ *2( h4 ":" ) h4 ] "::" 2( h4 ":" ) ls32 | ||||
| / [ *3( h4 ":" ) h4 ] "::" h4 ":" ls32 | ||||
| / [ *4( h4 ":" ) h4 ] "::" ls32 | ||||
| / [ *5( h4 ":" ) h4 ] "::" h4 | ||||
| / [ *6( h4 ":" ) h4 ] "::" | ||||
| h4 = 1*4HEXDIG | IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims | |||
| / ":" ) | ||||
| ls32 = ( h4 ":" h4 ) / IPv4address | IPv6address = 6( h16 ":" ) ls32 | |||
| / "::" 5( h16 ":" ) ls32 | ||||
| / [ h16 ] "::" 4( h16 ":" ) ls32 | ||||
| / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 | ||||
| / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 | ||||
| / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 | ||||
| / [ *4( h16 ":" ) h16 ] "::" ls32 | ||||
| / [ *5( h16 ":" ) h16 ] "::" h16 | ||||
| / [ *6( h16 ":" ) h16 ] "::" | ||||
| h16 = 1*4HEXDIG | ||||
| ls32 = ( h16 ":" h16 ) / IPv4address | ||||
| IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | IPv4address = dec-octet "." dec-octet "." dec-octet | |||
| "." dec-octet | ||||
| dec-octet = DIGIT ; 0-9 | dec-octet = DIGIT ; 0-9 | |||
| / %x31-39 DIGIT ; 10-99 | / %x31-39 DIGIT ; 10-99 | |||
| / "1" 2DIGIT ; 100-199 | / "1" 2DIGIT ; 100-199 | |||
| / "2" %x30-34 DIGIT ; 200-249 | / "2" %x30-34 DIGIT ; 200-249 | |||
| / "25" %x30-35 ; 250-255 | / "25" %x30-35 ; 250-255 | |||
| pct-encoded = "%" HEXDIG HEXDIG | pct-encoded = "%" HEXDIG HEXDIG | |||
| unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" | unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" | |||
| skipping to change at page 9, line 48 | skipping to change at page 10, line 33 | |||
| 3. Relationship between IRIs and URIs | 3. Relationship between IRIs and URIs | |||
| IRIs are meant to replace URIs in identifying resources for | IRIs are meant to replace URIs in identifying resources for | |||
| protocols, formats and software components which use a UCS-based | protocols, formats and software components which use a UCS-based | |||
| character repertoire. These protocols and components may never need | character repertoire. These protocols and components may never need | |||
| to use URIs directly, especially when the resource identifier is used | to use URIs directly, especially when the resource identifier is used | |||
| simply for identification purposes. However, when the resource | simply for identification purposes. However, when the resource | |||
| identifier is used for resource retrieval, it is in many cases | identifier is used for resource retrieval, it is in many cases | |||
| necessary to determine the associated URI because most retrieval | necessary to determine the associated URI because most retrieval | |||
| mechanisms currently only are defined for URIs. (Additional | mechanisms currently only are defined for URIs. In this case, IRIs | |||
| can serve as presentation elements for URI protocol elements. An | ||||
| example would be an address bar in a Web user agent. (Additional | ||||
| rationale is given in Section 3.1.) | rationale is given in Section 3.1.) | |||
| 3.1 Mapping of IRIs to URIs | 3.1 Mapping of IRIs to URIs | |||
| This section defines how to map an IRI to a URI. Everything in this | This section defines how to map an IRI to a URI. Everything in this | |||
| section applies also to IRI references and URI references, as well as | section applies also to IRI references and URI references, as well as | |||
| components thereof (for example fragment identifiers). | components thereof (for example fragment identifiers). | |||
| This mapping has two purposes: | This mapping has two purposes: | |||
| a) Syntactical: Many URI schemes and components define additional | a) Syntactical: Many URI schemes and components define additional | |||
| syntactical restrictions not captured in Section 2.2. Such | syntactical restrictions not captured in Section 2.2. | |||
| restrictions can be applied to IRIs by noting that IRIs are | Scheme-specific restrictions are applied to IRIs by converting | |||
| only valid if they map to syntactically valid URIs. This means | IRIs to URIs and checking the URIs against the scheme-specific | |||
| that such syntactical restrictions do not have to be defined | restrictions. | |||
| again on the IRI level. | ||||
| b) Interpretational: URIs identify resources in various ways. | b) Interpretational: URIs identify resources in various ways. IRIs | |||
| IRIs also identify resources. When the IRI is used solely for | also identify resources. When the IRI is used solely for | |||
| identification purposes, it is not necessary to map the IRI to | identification purposes, it is not necessary to map the IRI to a | |||
| a URI (see Section 5). However, when an IRI is used for | URI (see Section 5). However, when an IRI is used for resource | |||
| resource retrieval, the resource that the IRI locates is the | retrieval, the resource that the IRI locates is the same as the | |||
| same as the one located by the URI obtained after converting | one located by the URI obtained after converting the IRI according | |||
| the IRI according to the procedure defined here. This means | to the procedure defined here. This means that there is no need to | |||
| that there is no need to define resolution separately on the | define resolution separately on the IRI level. | |||
| IRI level. | ||||
| Applications MUST map IRIs to URIs using the following two steps. | Applications MUST map IRIs to URIs using the following two steps. | |||
| Step 1) This step generates a UCS-based encoding from the original | Step 1) This step generates a UCS-based character encoding from the | |||
| IRI format. This step has three variants, depending on the | original IRI format. This step has three variants, depending on | |||
| form of the input. | the form of the input. | |||
| Variant A) If the IRI is written on paper or read out loud, | Variant A) If the IRI is written on paper or read out loud, or | |||
| or otherwise represented as a sequence of characters | otherwise represented as a sequence of characters independent | |||
| independent of any encoding: Represent the IRI as a | of any character encoding: Represent the IRI as a sequence of | |||
| sequence of characters from the UCS normalized according | characters from the UCS normalized according to Normalization | |||
| to Normalization Form C (NFC, [UTR15]). | Form C (NFC, [UTR15]). | |||
| Variant B) If the IRI is in some digital representation | Variant B) If the IRI is in some digital representation (e.g. an | |||
| (e.g. an octet stream) in some known non-Unicode | octet stream) in some known non-Unicode character encoding: | |||
| encoding: Convert the IRI to a sequence of characters | Convert the IRI to a sequence of characters from the UCS | |||
| from the UCS normalized according to NFC. | normalized according to NFC. | |||
| Variant C) If the IRI is in an Unicode-based encoding (for | Variant C) If the IRI is in an Unicode-based character encoding | |||
| example UTF-8 or UTF-16): Do not normalize. Move | (for example UTF-8 or UTF-16): Do not normalize. Move directly | |||
| directly to Step 2. | to Step 2. | |||
| Step 2) For each character that is disallowed in URI references, | Step 2) For each character that is disallowed in URI references, | |||
| apply steps 1) through 3) below. The disallowed characters | apply Steps 2.1 through 2.3 below. The disallowed characters | |||
| consist of all non-ASCII characters allowed in IRIs. | consist of all non-ASCII characters allowed in IRIs. | |||
| 1) Convert the character to a sequence of one or more octets | 2.1) Convert the character to a sequence of one or more octets | |||
| using UTF-8 [RFC3629]. | using UTF-8 [RFC3629]. | |||
| 2) Convert each octet to %HH, where HH is the hexadecimal | 2.2) Convert each octet to %HH, where HH is the hexadecimal | |||
| notation of the octet value. Note: This is identical to | notation of the octet value. Note: This is identical to the | |||
| the escaping mechanism in Section 2.4.1 of [RFCYYYY]. To | percent-encoding mechanism in Section 2.1 of [RFCYYYY]. To | |||
| reduce variability, the hexadecimal notation SHOULD use | reduce variability, the hexadecimal notation SHOULD use upper | |||
| upper case letters. | case letters. | |||
| 3) Replace the original character by the resulting character | 2.3) Replace the original character by the resulting character | |||
| sequence (i.e. a sequence of %HH triplets). | sequence (i.e. a sequence of %HH triplets). | |||
| The above mapping from IRIs to URIs produces URIs fully conforming to | The above mapping from IRIs to URIs produces URIs fully conforming to | |||
| [RFCYYYY]. The mapping is also an identity transformation for URIs | [RFCYYYY]. The mapping is also an identity transformation for URIs | |||
| and is idempotent -- applying the mapping a second time will not | and is idempotent -- applying the mapping a second time will not | |||
| change anything. Every URI is by definition an IRI. | change anything. Every URI is by definition an IRI. | |||
| Infrastructure accepting IRIs MAY also convert the ireg-name | Infrastructure accepting IRIs MAY convert the ireg-name component of | |||
| component of an IRI as follows (before step 2 above) if it knows that | an IRI as follows (before Step 2.2 above) for schemes that are known | |||
| the scheme in question uses domain names: Replace the iregname part | to use domain names in ireg-name, but where the scheme definition | |||
| of the IRI by the part converted using the ToASCII operation | does not allow percent-encoding for ireg-name: Replace the ireg-name | |||
| specified in Section 4.1 of [RFC3490], with the flag | part of the IRI by the part converted using the ToASCII operation | |||
| specified in Section 4.1 of [RFC3490] on each dot-separated label, | ||||
| and using U+002E (FULL STOP) as a label separator, with the flag | ||||
| UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set to | UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set to | |||
| FALSE for creating IRIs and set to TRUE otherwise. The ToASCII | FALSE for creating IRIs and set to TRUE otherwise. The ToASCII | |||
| operation may fail, but this would mean that the IRI cannot be | operation may fail, but this would mean that the IRI cannot be | |||
| resolved. For example, the IRI | resolved. This conversion SHOULD be used when the goal is to maximize | |||
| interoperability with legacy URI resolvers. For example, the IRI | ||||
| http://résumé.example.org may be converted to | http://résumé.example.org may be converted to | |||
| http://xn--rsum-bpad.example.org instead of | http://xn--rsum-bpad.example.org instead of | |||
| http://r%C3%A9sum%C3%A9.example.org. | http://r%C3%A9sum%C3%A9.example.org. | |||
| Note: The uniform treatment of the whole IRI in step 2) above is | An IRI with a scheme that is known to use domain names in ireg-name, | |||
| but where the scheme definition does not allow percent-encoding for | ||||
| ireg-name, meets scheme-specific restrictions if either the | ||||
| straightforward conversion or the conversion using the ToASCII | ||||
| operation on ireg-name result in an URI that meets the | ||||
| scheme-specific restrictions. An IRI with a scheme that is known to | ||||
| use domain names in ireg-name, but where the scheme definition does | ||||
| not allow percent-encoding for ireg-name, resolves to the URI | ||||
| obtained after converting the IRI including using the ToASCII | ||||
| operation on ireg-name. Implementations do not need to do this | ||||
| conversion as long as they produce the same result. | ||||
| Note: The uniform treatment of the whole IRI in Step 2.2 above is | ||||
| important to not make processing dependent on URI scheme. See | important to not make processing dependent on URI scheme. See | |||
| [Gettys] for an in-depth discussion. | [Gettys] for an in-depth discussion. | |||
| Note: In practice, the difference above will not be noticed if | Note: In practice, the difference above will not be noticed if | |||
| mapping from IRI to URI and resolution is tightly integrated | mapping from IRI to URI and resolution is tightly integrated (e.g. | |||
| (e.g. carried out in the same user agent). But conversion | carried out in the same user agent). But conversion using | |||
| using [RFC3490] may be able to better deal with backwards | [RFC3490] may be able to better deal with backwards compatibility | |||
| compatibility issues in case mapping and resolution are | issues in case mapping and resolution are separated, as in the | |||
| separated, as in the case of using an HTTP proxy. | case of using an HTTP proxy. | |||
| Note: Internationalized Domain Names may be contained in parts of | Note: Internationalized Domain Names may be contained in parts of an | |||
| an IRI other than the ireg-name part. It is the responsibility | IRI other than the ireg-name part. It is the responsibility of | |||
| of scheme-specific implementations (if the Internationalized | scheme-specific implementations (if the Internationalized Domain | |||
| Domain Name is part of the scheme syntax) or of server-side | Name is part of the scheme syntax) or of server-side | |||
| implementations (if the Internationalized Domain Name is part | implementations (if the Internationalized Domain Name is part of | |||
| of 'iquery') to apply the necessary conversions at the | 'iquery') to apply the necessary conversions at the appropriate | |||
| appropriate point. Example: Trying to validate the Web page at | point. Example: Trying to validate the Web page at | |||
| http://résumé.example.org would lead to an IRI of | http://résumé.example.org would lead to an IRI of | |||
| http://validator.w3.org/ | http://validator.w3.org/ | |||
| check?uri=http%3A%2F%2Frésumé.example.org, which | check?uri=http%3A%2F%2Frésumé.example.org, which would | |||
| would convert to a URI of | convert to a URI of | |||
| http://validator.w3.org/ | http://validator.w3.org/ | |||
| check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.example.org. The | check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.example.org. The server | |||
| server side implementation would be responsible to do the | side implementation would be responsible to do the necessary | |||
| necessary conversions in order to be able to retrieve the Web | conversions in order to be able to retrieve the Web page. | |||
| page. | ||||
| Infrastructure accepting IRIs MAY also deal with the printable | Infrastructure accepting IRIs MAY also deal with the printable | |||
| characters in US-ASCII that are not allowed in URIs, namely "<", ">", | characters in US-ASCII that are not allowed in URIs, namely "<", ">", | |||
| '"', Space, "{", "}", "|", "\", "^", and "`", in step 2) above. If | '"', Space, "{", "}", "|", "\", "^", and "`", in Step 2.2 above. If | |||
| such characters are found but are not converted, then the conversion | such characters are found but are not converted, then the conversion | |||
| SHOULD fail. Please note that the number sign ("#"), the percent | SHOULD fail. Please note that the number sign ("#"), the percent sign | |||
| sign ("%"), and the square bracket characters ("[", "]") are not part | ("%"), and the square bracket characters ("[", "]") are not part of | |||
| of the above list, and MUST NOT be converted. Protocols and formats | the above list, and MUST NOT be converted. Protocols and formats that | |||
| that have used earlier definitions of IRIs including these characters | have used earlier definitions of IRIs including these characters MAY | |||
| MAY require unescaping of these characters as a preprocessing step to | require percent-encoding of these characters as a preprocessing step | |||
| extract the actual IRI from a given field. Such preprocessing MAY | to extract the actual IRI from a given field. Such preprocessing MAY | |||
| also be used by applications allowing the user to enter an IRI. | also be used by applications allowing the user to enter an IRI. | |||
| Note: In this process (in step 2.3), characters allowed in URI | Note: In this process (in Step 2.3), characters allowed in URI | |||
| references as well as existing escape sequences are not escaped | references as well as existing percent-encoded sequences are not | |||
| further. (This mapping is similar to, but different from, the | encoded further. (This mapping is similar to, but different from, | |||
| escaping applied when including arbitrary content into some | the encoding applied when including arbitrary content into some | |||
| part of a URI.) For example, an IRI of | part of a URI.) For example, an IRI of | |||
| http://www.example.org/red%09rosé#red (in XML notation) is | http://www.example.org/red%09rosé#red (in XML notation) is | |||
| converted to | converted to | |||
| http://www.example.org/red%09ros%C3%A9#red, not to something | http://www.example.org/red%09ros%C3%A9#red, not to something like | |||
| like | ||||
| http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red. | http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red. | |||
| Note: Some older software transcoding to UTF-8 may produce illegal | Note: Some older software transcoding to UTF-8 may produce illegal | |||
| output for some input, in particular for characters outside the | output for some input, in particular for characters outside the | |||
| BMP (Basic Multilingual Plane). As an example, for the | BMP (Basic Multilingual Plane). As an example, for the following | |||
| following IRI with non-BMP characters (in XML Notation): | IRI with non-BMP characters (in XML Notation): | |||
| http://example.com/𐌀𐌁𐌂 | http://example.com/𐌀𐌁𐌂 | |||
| (the first three letters of the Old Italic alphabet) the | (the first three letters of the Old Italic alphabet) the correct | |||
| correct conversion to a URI is: | conversion to a URI is: | |||
| http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82 | http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82 | |||
| 3.2 Converting URIs to IRIs | 3.2 Converting URIs to IRIs | |||
| In some situations, it may be desirable to try to convert a URI into | In some situations, it may be desirable to try to convert a URI into | |||
| an equivalent IRI. This section gives a procedure to do such a | an equivalent IRI. This section gives a procedure to do such a | |||
| conversion. The conversion described in this section will always | conversion. The conversion described in this section will always | |||
| result in an IRI which maps back to the URI that was used as an input | result in an IRI which maps back to the URI that was used as an input | |||
| for the conversion (except for potential case differences in escape | for the conversion (except for potential case differences in | |||
| sequences). However, the IRI resulting from this conversion may not | percent-encoding). However, the IRI resulting from this conversion | |||
| be exactly the same as the original IRI (if there ever was one). | may not be exactly the same as the original IRI (if there ever was | |||
| one). | ||||
| URI to IRI conversion removes escape sequences, but not all escaping | URI to IRI conversion removes percent-encodings, but not all | |||
| can be eliminated. There are several reasons for this: | percent-encodings can be eliminated. There are several reasons for | |||
| this: | ||||
| a) Some escape sequences are necessary to distinguish escaped and | a) Some percent-encodings are necessary to distinguish | |||
| unescaped uses of reserved characters. | percent-encoded and unencoded uses of reserved characters. | |||
| b) Some escape sequences cannot be interpreted as sequences of | b) Some percent-encodings cannot be interpreted as sequences of UTF-8 | |||
| UTF-8 octets. | octets. | |||
| (Note: The octet patterns of UTF-8 are highly regular. | (Note: The octet patterns of UTF-8 are highly regular. Therefore, | |||
| Therefore, there is a very high probability, but no guarantee, | there is a very high probability, but no guarantee, that | |||
| that escape sequences that can be interpreted as sequences of | percent-encodings that can be interpreted as sequences of UTF-8 | |||
| UTF-8 octets actually originated from UTF-8. For a detailed | octets actually originated from UTF-8. For a detailed discussion, | |||
| discussion, see [Duerst97].) | see [Duerst97].) | |||
| c) The conversion may result in a character that is not | c) The conversion may result in a character that is not appropriate | |||
| appropriate in an IRI. See Section 6.1 for further details. | in an IRI. See Section 6.1 for further details. | |||
| Conversion from a URI to an IRI is done using the following steps (or | Conversion from a URI to an IRI is done using the following steps (or | |||
| any other algorithm that produces the same result): | any other algorithm that produces the same result): | |||
| 1) Represent the URI as a sequence of octets in US-ASCII. | 1) Represent the URI as a sequence of octets in US-ASCII. | |||
| 2) Convert all hexadecimal escapes (% followed by two hexadecimal | 2) Convert all percent-encodings (% followed by two hexadecimal | |||
| digits) except those corresponding to '%', characters in | digits) except those corresponding to '%', characters in | |||
| 'reserved', and characters in US-ASCII not allowed in URIs, to | 'reserved', and characters in US-ASCII not allowed in URIs, to the | |||
| the corresponding octets. | corresponding octets. | |||
| 3) Re-escape any octet produced in step 2) that is not part of a | 3) Re-percent-encode any octet produced in Step 2 that is not part of | |||
| strictly legal UTF-8 octet sequence. | a strictly legal UTF-8 octet sequence. | |||
| 4) Re-escape all octets produced in step 3) that in UTF-8 | 4) Re-percent-encode all octets produced in Step 3 that in UTF-8 | |||
| represent characters that are not appropriate according to | represent characters that are not appropriate according to Section | |||
| Section 4.1 and Section 6.1. | 4.1 and Section 6.1. | |||
| 5) Interpret the resulting octet sequence as a sequence of | 5) Interpret the resulting octet sequence as a sequence of characters | |||
| characters encoded in UTF-8. | encoded in UTF-8. | |||
| This procedure will convert as many escaped non-ASCII characters as | This procedure will convert as many percent-encoded non-ASCII | |||
| possible to characters in an IRI. Because there are some choices | characters as possible to characters in an IRI. Because there are | |||
| when applying step 4) (see Section 6.1), results may vary. | some choices when applying Step 4 (see Section 6.1), results may | |||
| vary. | ||||
| Conversions from URIs to IRIs MUST NOT use any other encoding than | Conversions from URIs to IRIs MUST NOT use any other character | |||
| UTF-8 in steps 3) and 4) above, even if it might be possible from | encoding than UTF-8 in Steps 3 and 4 above, even if it might be | |||
| context to guess that another encoding than UTF-8 was used in the | possible from context to guess that another character encoding than | |||
| URI. As an example, the URI http://www.example.org/r%E9sum%E9.html | UTF-8 was used in the URI. As an example, the URI http:// | |||
| might with some guessing be interpreted to contain two e-acute | www.example.org/r%E9sum%E9.html might with some guessing be | |||
| characters encoded as iso-8859-1. It must not be converted to an IRI | interpreted to contain two e-acute characters encoded as iso-8859-1. | |||
| containing these e-acute characters. Otherwise, the IRI will in the | It must not be converted to an IRI containing these e-acute | |||
| future be mapped to http://www.example.org/r%C3%A9sum%C3%A9.html, | characters. Otherwise, the IRI will in the future be mapped to http:/ | |||
| which is a different URI than http://www.example.org/r%E9sum%E9.html. | /www.example.org/r%C3%A9sum%C3%A9.html, which is a different URI than | |||
| http://www.example.org/r%E9sum%E9.html. | ||||
| 3.2.1 Examples | 3.2.1 Examples | |||
| This section shows various examples of converting URIs to IRIs. The | This section shows various examples of converting URIs to IRIs. The | |||
| notation <hh> is used to denote octets outside those that can be | notation <hh> is used to denote octets outside those that can be | |||
| represented in this document. Each example shows the result after | represented in this document. Each example shows the result after | |||
| applying each of the steps 1) to 5). XML Notation is used for the | applying each of the Steps 1 to 5. XML Notation is used for the final | |||
| final result. | result. | |||
| The following example contains the sequence '%C3%BC', which is a | The following example contains the sequence '%C3%BC', which is a | |||
| strictly legal UTF-8 sequence, and which is converted into the actual | strictly legal UTF-8 sequence, and which is converted into the actual | |||
| character U+00FC LATIN SMALL LETTER U WITH DIAERESIS (also known as | character U+00FC LATIN SMALL LETTER U WITH DIAERESIS (also known as | |||
| u-umlaut). | u-umlaut). | |||
| 1) http://www.example.org/D%C3%BCrst | 1) http://www.example.org/D%C3%BCrst | |||
| 2) http://www.example.org/D<c3><bc>rst | 2) http://www.example.org/D<c3><bc>rst | |||
| 3) http://www.example.org/D<c3><bc>rst | 3) http://www.example.org/D<c3><bc>rst | |||
| 4) http://www.example.org/D<c3><bc>rst | 4) http://www.example.org/D<c3><bc>rst | |||
| 5) http://www.example.org/Dürst | 5) http://www.example.org/Dürst | |||
| The following example contains the sequence '%FC', which might | The following example contains the sequence '%FC', which might | |||
| represent U+00FC LATIN SMALL LETTER U WITH DIAERESIS in the | represent U+00FC LATIN SMALL LETTER U WITH DIAERESIS in the | |||
| iso-8859-1 encoding. (It might represent other characters in other | iso-8859-1 character encoding. (It might represent other characters | |||
| encodings. For example, the octet <fc> in iso-8859-5 represents | in other character encodings. For example, the octet <fc> in | |||
| U+045C CYRILLIC SMALL LETTER KJE.) Because <fc> is not part of a | iso-8859-5 represents U+045C CYRILLIC SMALL LETTER KJE.) Because <fc> | |||
| strictly legal UTF-8 sequence, it is re-escaped in step 3). | is not part of a strictly legal UTF-8 sequence, it is | |||
| re-percent-encoded in Step 3. | ||||
| 1) http://www.example.org/D%FCrst | 1) http://www.example.org/D%FCrst | |||
| 2) http://www.example.org/D<fc>rst | 2) http://www.example.org/D<fc>rst | |||
| 3) http://www.example.org/D%FCrst | 3) http://www.example.org/D%FCrst | |||
| 4) http://www.example.org/D%FCrst | 4) http://www.example.org/D%FCrst | |||
| 5) http://www.example.org/D%FCrst | 5) http://www.example.org/D%FCrst | |||
| The following example contains '%e2%80%ae', which is the escaped | The following example contains '%e2%80%ae', which is the | |||
| UTF-8 encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. Section 4.1 | percent-encoded | |||
| forbids the direct use of this character in an IRI. Therefore, the | UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. Section | |||
| corresponding octets are re-escaped in step 4). This example shows | 4.1 forbids the direct use of this character in an IRI. Therefore, | |||
| that the case (upper or lower) of letters used in escapes may not be | the corresponding octets are re-percent-encoded in Step 4. This | |||
| preserved. The example also contains a punycode-encoded domain name | example shows that the case (upper or lower) of letters used in | |||
| label (xn--99zt52a), which is not converted. | percent-encodes may not be preserved. The example also contains a | |||
| punycode-encoded domain name label (xn--99zt52a), which is not | ||||
| converted. | ||||
| 1) http://xn--99zt52a.example.org/%e2%80%ae | 1) http://xn--99zt52a.example.org/%e2%80%ae | |||
| 2) http://xn--99zt52a.example.org/<e2><80><ae> | 2) http://xn--99zt52a.example.org/<e2><80><ae> | |||
| 3) http://xn--99zt52a.example.org/<e2><80><ae> | 3) http://xn--99zt52a.example.org/<e2><80><ae> | |||
| 4) http://xn--99zt52a.example.org/%E2%80%AE | 4) http://xn--99zt52a.example.org/%E2%80%AE | |||
| 5) http://xn--99zt52a.example.org/%E2%80%AE | 5) http://xn--99zt52a.example.org/%E2%80%AE | |||
| Implementations with scheme-specific knowledge MAY convert punycode- | Implementations with scheme-specific knowledge MAY convert | |||
| encoded domain name labels to the corresponding characters using the | punycode-encoded domain name labels to the corresponding characters | |||
| ToUnicode procedure. Thus, for the example above, the label xn-- | using the ToUnicode procedure. Thus, for the example above, the label | |||
| 99zt52a may be converted to U+7D0D U+8C46 (Japanese Natto), leading | xn--99zt52a may be converted to U+7D0D U+8C46 (Japanese Natto), | |||
| to the overall IRI of | leading to the overall IRI of | |||
| http://納豆.example.org/%E2%80%AE | http://納豆.example.org/%E2%80%AE | |||
| 4. Bidirectional IRIs for Right-to-left Languages | 4. Bidirectional IRIs for Right-to-left Languages | |||
| Some UCS characters, such as those used in the Arabic and Hebrew | Some UCS characters, such as those used in the Arabic and Hebrew | |||
| script, have an inherent right-to-left (rtl) writing direction. IRIs | script, have an inherent right-to-left (rtl) writing direction. IRIs | |||
| containing such characters (called bidirectional IRIs or Bidi IRIs) | containing such characters (called bidirectional IRIs or Bidi IRIs) | |||
| require additional attention because of the non-trivial relation | require additional attention because of the non-trivial relation | |||
| between logical representation (used for digital representation as | between logical representation (used for digital representation as | |||
| well as when reading/spelling) and visual representation (used for | well as when reading/spelling) and visual representation (used for | |||
| display/printing). | display/printing). | |||
| Because of the complex interaction between the logical | Because of the complex interaction between the logical | |||
| representation, the visual representation, and the syntax of a Bidi | representation, the visual representation, and the syntax of a Bidi | |||
| IRI, a balance is needed between various requirements. The main | IRI, a balance is needed between various requirements. The main | |||
| requirements are: | requirements are: | |||
| 1) user-predictable conversion between visual and logical | 1) user-predictable conversion between visual and logical | |||
| representation; | representation; | |||
| 2) the ability to include a wide range of characters in various | ||||
| parts of the IRI; | 2) the ability to include a wide range of characters in various parts | |||
| of the IRI; | ||||
| 3) minor or no changes or restrictions for implementations. | 3) minor or no changes or restrictions for implementations. | |||
| 4.1 Logical Storage and Visual Presentation | 4.1 Logical Storage and Visual Presentation | |||
| When stored or transmitted in digital representation, bidirectional | When stored or transmitted in digital representation, bidirectional | |||
| IRIs MUST be in full logical order, and MUST conform to the IRI | IRIs MUST be in full logical order, and MUST conform to the IRI | |||
| syntax rules (which includes the rules relevant to their scheme). | syntax rules (which includes the rules relevant to their scheme). | |||
| This assures that bidirectional IRIs can be processed in the same way | This assures that bidirectional IRIs can be processed in the same way | |||
| as other IRIs. | as other IRIs. | |||
| When rendered, bidirectional IRIs MUST be rendered using the Unicode | When rendered, bidirectional IRIs MUST be rendered using the Unicode | |||
| Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be | Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be | |||
| rendered in the same way as they would be rendered if they were in an | rendered in the same way as they would be rendered if they were in an | |||
| left-to-right embedding, i.e. as if they were preceded by U+202A, | left-to-right embedding, i.e. as if they were preceded by U+202A, | |||
| LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP | LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP | |||
| DIRECTIONAL FORMATTING (PDF). Setting the embedding direction can | DIRECTIONAL FORMATTING (PDF). Setting the embedding direction can | |||
| also be done in a higher-order protocol (e.g. the dir='ltr' | also be done in a higher-order protocol (e.g. the dir='ltr' attribute | |||
| attribute in HTML). | in HTML). | |||
| There is no requirement to actually use the above embedding if the | There is no requirement to actually use the above embedding if the | |||
| display is still the same without the embedding. For example, a | display is still the same without the embedding. For example, a | |||
| bidirectional IRI in a text with left-to-right base directionality | bidirectional IRI in a text with left-to-right base directionality | |||
| (such as used for English or Cyrillic) that is preceded and followed | (such as used for English or Cyrillic) that is preceded and followed | |||
| by whitespace and strong left-to-right characters does not need an | by whitespace and strong left-to-right characters does not need an | |||
| embedding. Also, a bidirectional relative IRI that only contains | embedding. Also, a bidirectional relative IRI that only contains | |||
| strong right-to-left characters and weak characters and that starts | strong right-to-left characters and weak characters and that starts | |||
| and ends with a strong rigth-to-left character and appears in a text | and ends with a strong rigth-to-left character and appears in a text | |||
| with right-to-left base directionality (such as used for Arabic or | with right-to-left base directionality (such as used for Arabic or | |||
| skipping to change at page 17, line 9 | skipping to change at page 18, line 11 | |||
| The Unicode Bidirectional Algorithm ([UNI9], Section 4.3) permits | The Unicode Bidirectional Algorithm ([UNI9], Section 4.3) permits | |||
| higher-level protocols to influence bidirectional rendering. Such | higher-level protocols to influence bidirectional rendering. Such | |||
| changes by higher-level protocols MUST NOT be used if they change the | changes by higher-level protocols MUST NOT be used if they change the | |||
| rendering of IRIs. | rendering of IRIs. | |||
| The bidirectional formatting characters that may be used before or | The bidirectional formatting characters that may be used before or | |||
| after the IRI to assure correct display are themselves not part of | after the IRI to assure correct display are themselves not part of | |||
| the IRI. IRIs MUST NOT contain bidirectional formatting characters | the IRI. IRIs MUST NOT contain bidirectional formatting characters | |||
| (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual | (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual | |||
| rendering of the IRI, but do not themselves appear visually. It | rendering of the IRI, but do not themselves appear visually. It would | |||
| would therefore not be possible to correctly input an IRI with such | therefore not be possible to correctly input an IRI with such | |||
| characters. | characters. | |||
| 4.2 Bidi IRI Structure | 4.2 Bidi IRI Structure | |||
| The Unicode Bidirectional Algorithm is designed mainly for running | The Unicode Bidirectional Algorithm is designed mainly for running | |||
| text. To make sure that it does not affect the rendering of | text. To make sure that it does not affect the rendering of | |||
| bidirectional IRIs too much, some restrictions on bidirectional IRIs | bidirectional IRIs too much, some restrictions on bidirectional IRIs | |||
| are necessary. These restrictions are given in terms of delimiters | are necessary. These restrictions are given in terms of delimiters | |||
| (structural characters, mostly punctuation such as '@', '.', ':', | (structural characters, mostly punctuation such as '@', '.', ':', | |||
| '/') and components (usually consisting mostly of letters and | '/') and components (usually consisting mostly of letters and | |||
| digits). | digits). | |||
| The following syntax rules from Section 2.2 correspond to components | The following syntax rules from Section 2.2 correspond to components | |||
| for the purpose of Bidi behavior: iuserinfo, isegment, ireg-name, | for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment, | |||
| iquery, and ifragment. | isegment-nz, isegment-nzc, ireg-name, iquery, and ifragment. | |||
| Specifications that define the syntax of any of the above components | Specifications that define the syntax of any of the above components | |||
| MAY divide them further and define smaller parts to be components | MAY divide them further and define smaller parts to be components | |||
| according to this document. As an example, the restrictions of | according to this document. As an example, the restrictions of | |||
| [RFC3490] on bidirectional domain names correspond to treating each | [RFC3490] on bidirectional domain names correspond to treating each | |||
| label of a domain name as a component for those schemes where ireg- | label of a domain name as a component for those schemes where | |||
| name is a domain name. Even where the components are not defined | ireg-name is a domain name. Even where the components are not defined | |||
| formally, it may be helpful to think about some syntax in terms of | formally, it may be helpful to think about some syntax in terms of | |||
| components and to apply the relevant restrictions. For example, for | components and to apply the relevant restrictions. For example, for | |||
| the usual name/value syntax in query parts, it is convenient to treat | the usual name/value syntax in query parts, it is convenient to treat | |||
| each name and each value as a component. As another example, the | each name and each value as a component. As another example, the | |||
| extensions in a resource name can be treated as separate components. | extensions in a resource name can be treated as separate components. | |||
| For each component, the following restrictions apply: | For each component, the following restrictions apply: | |||
| 1) A component SHOULD NOT not use both right-to-left and left-to- | 1) A component SHOULD NOT use both right-to-left and left-to-right | |||
| right characters. | characters. | |||
| 2) A component using right-to-left characters SHOULD start and end | 2) A component using right-to-left characters SHOULD start and end | |||
| with right-to-left characters. | with right-to-left characters. | |||
| The above restrictions are given as shoulds, rather than as musts. | The above restrictions are given as shoulds, rather than as musts. | |||
| For IRIs that are never presented visually, they are not relevant. | For IRIs that are never presented visually, they are not relevant. | |||
| However, for IRIs in general, they are very important to insure | However, for IRIs in general, they are very important to insure | |||
| consistent conversion between visual presentation and logical | consistent conversion between visual presentation and logical | |||
| representation, in both directions. | representation, in both directions. | |||
| Note: In some components, the above restrictions may actually be | Note: In some components, the above restrictions may actually be | |||
| strictly enforced. For example, [RFC3490] requires that these | strictly enforced. For example, [RFC3490] requires that these | |||
| restrictions apply to the labels of a host name for those | restrictions apply to the labels of a host name for those schemes | |||
| schemes where ireg-name is a host name. In some other | where ireg-name is a host name. In some other components, for | |||
| components, for example path components, following these | example path components, following these restrictions may not be | |||
| restrictions may not be too difficult. For other components, | too difficult. For other components, such as parts of the query | |||
| such as parts of the query part, it may be very difficult to | part, it may be very difficult to enforce the restrictions, | |||
| enforce the restrictions, because the values of query | because the values of query parameters may be arbitrary character | |||
| parameters may be arbitrary character sequences. | sequences. | |||
| If the above restrictions cannot be satisfied otherwise, the affected | If the above restrictions cannot be satisfied otherwise, the affected | |||
| component can always be mapped to URI notation as described in | component can always be mapped to URI notation as described in | |||
| Section 3.1. Please note that the whole component needs to be mapped | Section 3.1. Please note that the whole component needs to be mapped | |||
| (see also Example 9 below). | (see also Example 9 below). | |||
| 4.3 Input of Bidi IRIs | 4.3 Input of Bidi IRIs | |||
| Bidi input methods MUST generate Bidi IRIs in logical order while | Bidi input methods MUST generate Bidi IRIs in logical order while | |||
| rendering them according to Section 4.1. During input, rendering | rendering them according to Section 4.1. During input, rendering | |||
| skipping to change at page 19, line 11 | skipping to change at page 20, line 14 | |||
| inverted as a whole: | inverted as a whole: | |||
| logical representation: http://ab.CDE.FGH/ij/kl/mn/op.html | logical representation: http://ab.CDE.FGH/ij/kl/mn/op.html | |||
| visual representation: http://ab.HGF.EDC/ij/kl/mn/op.html | visual representation: http://ab.HGF.EDC/ij/kl/mn/op.html | |||
| A sequence of rtl components is read rtl, in the same way as a | A sequence of rtl components is read rtl, in the same way as a | |||
| sequence of rtl words is read rtl in a bidi text. | sequence of rtl words is read rtl in a bidi text. | |||
| Example 3: All components of an IRI (except for the scheme) are rtl. | Example 3: All components of an IRI (except for the scheme) are rtl. | |||
| All rtl components are inverted overall: | All rtl components are inverted overall: | |||
| logical representation: http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV | logical representation: http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV | |||
| visual representation: http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA | visual representation: http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA | |||
| The whole IRI (except the scheme) is read rtl. Delimiters between | The whole IRI (except the scheme) is read rtl. Delimiters between rtl | |||
| rtl components stay between the respective components; delimiters | components stay between the respective components; delimiters between | |||
| between ltr and rtl components don't move. | ltr and rtl components don't move. | |||
| Example 4: Several sequences of rtl components are each inverted on | Example 4: Several sequences of rtl components are each inverted on | |||
| their own: | their own: | |||
| logical representation: http://AB.CD.ef/gh/IJ/KL.html | logical representation: http://AB.CD.ef/gh/IJ/KL.html | |||
| visual representation: http://DC.BA.ef/gh/LK/JI.html | visual representation: http://DC.BA.ef/gh/LK/JI.html | |||
| Each sequence of rtl components is read rtl, in the same way as each | Each sequence of rtl components is read rtl, in the same way as each | |||
| sequence of rtl words in an ltr text is read rtl. | sequence of rtl words in an ltr text is read rtl. | |||
| Example 5: Example 2, applied to components of different kinds: | Example 5: Example 2, applied to components of different kinds: | |||
| logical representation: http://ab.cd.EF/GH/ij/kl.html | logical representation: http://ab.cd.EF/GH/ij/kl.html | |||
| visual representation: http://ab.cd.HG/FE/ij/kl.html | visual representation: http://ab.cd.HG/FE/ij/kl.html | |||
| The inversion of the domain name label and the path component may be | The inversion of the domain name label and the path component may be | |||
| unexpected, but is consistent with other bidi behavior. For | unexpected, but is consistent with other bidi behavior. For | |||
| reassurance that the domain component really is "ab.cd.EF", it may be | reassurance that the domain component really is "ab.cd.EF", it may be | |||
| helpful to read aloud the visual representation following the bidi | helpful to read aloud the visual representation following the bidi | |||
| algorithm. After "http://ab.cd." one reads the RTL block "E-F-slash- | algorithm. After "http://ab.cd." one reads the RTL block | |||
| G-H", which corresponds to the logical representation. | "E-F-slash-G-H", which corresponds to the logical representation. | |||
| Example 6: Same as example 5, with more rtl components: | Example 6: Same as example 5, with more rtl components: | |||
| logical representation: http://ab.CD.EF/GH/IJ/kl.html | logical representation: http://ab.CD.EF/GH/IJ/kl.html | |||
| visual representation: http://ab.JI/HG/FE.DC/kl.html | visual representation: http://ab.JI/HG/FE.DC/kl.html | |||
| The inversion of the domain name labels and the path components may | The inversion of the domain name labels and the path components may | |||
| be easier to identify because the delimiters also move. | be easier to identify because the delimiters also move. | |||
| Example 7: A single rtl component with included digits: | Example 7: A single rtl component with included digits: | |||
| logical representation: http://ab.CDE123FGH.ij/kl/mn/op.html | logical representation: http://ab.CDE123FGH.ij/kl/mn/op.html | |||
| visual representation: http://ab.HGF123EDC.ij/kl/mn/op.html | visual representation: http://ab.HGF123EDC.ij/kl/mn/op.html | |||
| skipping to change at page 20, line 7 | skipping to change at page 21, line 10 | |||
| Example 8 (not allowed): Numbers at the start or end of a rtl | Example 8 (not allowed): Numbers at the start or end of a rtl | |||
| component: | component: | |||
| logical representation: http://ab.cd.ef/GH1/2IJ/KL.html | logical representation: http://ab.cd.ef/GH1/2IJ/KL.html | |||
| visual representation: http://ab.cd.ef/LK/JI1/2HG.html | visual representation: http://ab.cd.ef/LK/JI1/2HG.html | |||
| The sequence '1/2' is interpreted by the bidi algorithm as a | The sequence '1/2' is interpreted by the bidi algorithm as a | |||
| fraction, fragmenting the components and leading to confusion. There | fraction, fragmenting the components and leading to confusion. There | |||
| are other characters that are interpreted in a special way close to | are other characters that are interpreted in a special way close to | |||
| numbers, in particular '+', '-', '#', '$', '%', ',', '.', and ':'. | numbers, in particular '+', '-', '#', '$', '%', ',', '.', and ':'. | |||
| Example 9 (not allowed): The numbers in the previous example are | Example 9 (not allowed): The numbers in the previous example are | |||
| escaped: | percent-encoded: | |||
| logical representation: http://ab.cd.ef/GH%31/%32IJ/KL.html, | logical representation: http://ab.cd.ef/GH%31/%32IJ/KL.html, | |||
| visual representation (Hebrew): http://ab.cd.ef/LK/JI%32/%31HG.html | visual representation (Hebrew): http://ab.cd.ef/LK/JI%32/%31HG.html | |||
| visual representation (Arabic): http://ab.cd.ef/LK/JI32%/31%HG.html | visual representation (Arabic): http://ab.cd.ef/LK/JI32%/31%HG.html | |||
| Depending on whether the upper-case letters represent Arabic or | Depending on whether the upper-case letters represent Arabic or | |||
| Hebrew, the visual representation is different. | Hebrew, the visual representation is different. | |||
| Example 10 (allowed, but not recommended): | Example 10 (allowed, but not recommended): | |||
| logical representation: http://ab.CDEFGH.123/kl/mn/op.html | logical representation: http://ab.CDEFGH.123/kl/mn/op.html | |||
| visual representation: http://ab.123.HGFEDC/kl/mn/op.html | visual representation: http://ab.123.HGFEDC/kl/mn/op.html | |||
| Components consisting of only numbers are allowed (it would be rather | Components consisting of only numbers are allowed (it would be rather | |||
| difficult to prohibit them), but may interact with adjacent RTL | difficult to prohibit them), but may interact with adjacent RTL | |||
| components in ways that are not easy to predict. | components in ways that are not easy to predict. | |||
| 5. IRI Equivalence and Comparison | 5. IRI Equivalence and Comparison | |||
| This section discusses IRI Equivalence and Comparison similar to | This section discusses IRI Equivalence and Comparison similar to | |||
| Section 6, "Normalization and Comparison", in [RFCYYYY]. This | Section 6, "Normalization and Comparison", in [RFCYYYY]. This section | |||
| section focuses on the main issues and on aspects that are different | focuses on the main issues and on aspects that are different from | |||
| from [RFCYYYY]; Section 6 of [RFCYYYY] is recommended background | [RFCYYYY]; Section 6 of [RFCYYYY] is recommended background reading. | |||
| reading. | ||||
| There is no general rule or procedure to decide whether two arbitrary | There is no general rule or procedure to decide whether two arbitrary | |||
| IRIs are equivalent or not (i.e. whether they refer to the same | IRIs are equivalent or not (i.e. whether they refer to the same | |||
| resource or not). Two IRIs that look almost the same may refer to | resource or not). Two IRIs that look almost the same may refer to | |||
| different resources. Two IRIs that look completely different may | different resources. Two IRIs that look completely different may | |||
| refer to the same resource. Each specification or application that | refer to the same resource. Each specification or application that | |||
| uses IRIs has to decide on the appropriate criterion for IRI | uses IRIs has to decide on the appropriate criterion for IRI | |||
| equivalence. | equivalence. | |||
| 5.1 Simple String Comparison | 5.1 Simple String Comparison | |||
| skipping to change at page 21, line 11 | skipping to change at page 22, line 12 | |||
| http://example.org/%7Euser are not equivalent under this definition. | http://example.org/%7Euser are not equivalent under this definition. | |||
| In such a case, the comparison function MUST NOT map IRIs to URIs, | In such a case, the comparison function MUST NOT map IRIs to URIs, | |||
| because such a mapping would create additional spurious equivalences. | because such a mapping would create additional spurious equivalences. | |||
| It follows that IRIs SHOULD NOT be modified when being transported if | It follows that IRIs SHOULD NOT be modified when being transported if | |||
| there is any chance that this IRI might be used as an identifier in | there is any chance that this IRI might be used as an identifier in | |||
| the way explained above. | the way explained above. | |||
| 5.2 Conversion to URIs | 5.2 Conversion to URIs | |||
| For actual resolution, differences in escaping (except for the | For actual resolution, differences in percent-encoding (except for | |||
| escaping of reserved characters) MUST always result in the same | the percent-encoding of reserved characters) MUST always result in | |||
| resource. For example, http://example.org/~user, | the same resource. For example, http://example.org/~user, | |||
| http://example.org/%7euser and http://example.org/%7Euser must | http://example.org/%7euser and http://example.org/%7Euser must | |||
| resolve to the same resource. | resolve to the same resource. | |||
| If this kind of equivalence is to be tested, the escaping of both | If this kind of equivalence is to be tested, the percent-encoding of | |||
| IRIs to be compared has to be aligned, for example by converting both | both IRIs to be compared has to be aligned, for example by converting | |||
| IRIs to URIs (see Section 3.1) and making sure that the case of the | both IRIs to URIs (see Section 3.1) and making sure that the case of | |||
| hexadecimal characters in the %-escape is always the same (preferably | the hexadecimal characters in the percent-encode is always the same | |||
| upper case). For comparison, such conversions MUST only be done on | (preferably upper case). For comparison, such conversions MUST only | |||
| the fly, while retaining the original IRI. | be done on the fly, while retaining the original IRI. | |||
| Additional, similar equivalences are possible based on knowledge | Additional, similar equivalences are possible based on knowledge | |||
| about the generic URI/IRI syntax, such as the fact that the scheme | about the generic URI/IRI syntax, such as the fact that the scheme | |||
| part is case-insensitive. | part is case-insensitive. | |||
| 5.3 Normalization | 5.3 Normalization | |||
| The Unicode Standard [UNIV4] defines various equivalences between | The Unicode Standard [UNIV4] defines various equivalences between | |||
| sequences of characters for various purposes. Unicode Standard Annex | sequences of characters for various purposes. Unicode Standard Annex | |||
| #15 [UTR15] defines various Normalization Forms for these | #15 [UTR15] defines various Normalization Forms for these | |||
| equivalences, in particular Normalization Form C (NFC, Canonical | equivalences, in particular Normalization Form C (NFC, Canonical | |||
| Decomposition, followed by Canonical Composition) and Normalization | Decomposition, followed by Canonical Composition) and Normalization | |||
| Form KC (NFKC, Compatibility Decomposition, followed by Canonical | Form KC (NFKC, Compatibility Decomposition, followed by Canonical | |||
| Composition). | Composition). | |||
| Equivalence of IRIs MUST rely on the assumption that IRIs are | Equivalence of IRIs MUST rely on the assumption that IRIs are | |||
| appropriately pre-normalized, rather than applying normalization when | appropriately pre-normalized, rather than applying normalization when | |||
| comparing two IRIs. The exceptions are conversion from a non-digital | comparing two IRIs. The exceptions are conversion from a non-digital | |||
| form, and conversion from a non-UCS-based encoding to an UCS-based | form, and conversion from a non-UCS-based character encoding to an | |||
| encoding. In these cases, NFC or a normalizing transcoder using NFC | UCS-based character encoding. In these cases, NFC or a normalizing | |||
| MUST be used for interoperability. To avoid false negatives and | transcoder using NFC MUST be used for interoperability. To avoid | |||
| problems with transcoding, IRIs SHOULD be created using NFC. Using | false negatives and problems with transcoding, IRIs SHOULD be created | |||
| NFKC may avoid even more problems, for example by choosing half-width | using NFC. Using NFKC may avoid even more problems, for example by | |||
| Latin letters instead of full-width, and full-width Katakana instead | choosing half-width Latin letters instead of full-width, and | |||
| of half-width. | full-width Katakana instead of half-width. | |||
| As an example, http://www.example.org/résumé.html (in XML | As an example, http://www.example.org/résumé.html (in XML | |||
| Notation) is in NFC. On the other hand, http://www.example.org/ | Notation) is in NFC. On the other hand, http://www.example.org/ | |||
| résumé.html is not in NFC. The former uses precombined | résumé.html is not in NFC. The former uses precombined | |||
| e-acute characters, the later uses 'e' characters followed by | e-acute characters, the later uses 'e' characters followed by | |||
| combining acute accents. Both usages are defined to be canonically | combining acute accents. Both usages are defined to be canonically | |||
| equivalent in [UNIV4]. | equivalent in [UNIV4]. | |||
| Because it is unknow how a particular field is being treated | Note: Because it is unknown how a particular field is being treated | |||
| with respect to text normalization, it would be inappropriate | with respect to text normalization, it would be inappropriate to | |||
| to allow third parties to normalize an IRI arbitrarily. This | allow third parties to normalize an IRI arbitrarily. This does not | |||
| does not contradict the recommendation that when a resource is | contradict the recommendation that when a resource is created, its | |||
| created, and an IRI for that resource, you try to be as | IRI should be as normalized as possible (i.e. NFC or even NFKC). | |||
| normalized as possible (i.e. NFC or even NFKC). This is | This is similar to the upper-case/lower-case problems in URIs. | |||
| similar to the upper-case/lower-case problems in URIs. Some | Some parts of a URI are case-insensitive (domain name). For | |||
| parts of a URI are case-insensitive (domain name). For others, | others, it is unclear whether they are case-sensitive or | |||
| it is unclear whether they are case-sensitive or case- | case-insensitive, or something in between (e.g. case-sensitive, | |||
| insensitive, or something in between (e.g. case-sensitive, but | but if the wrong case is used, a multiple choice selection is | |||
| if the wrong case is used, a multiple choice selection is | provided instead of a direct negative result). The best recipe is | |||
| provided instead of a direct negative result). The best recipe | that the creator uses a reasonable capitalization, and when | |||
| is that the generator uses a reasonable capitalization, and | transferring the URI, that capitalization is never changed. | |||
| when transfering the URI, that capitalization is never changed. | ||||
| Various IRI schemes may allow the usage of International Domain Names | Various IRI schemes may allow the usage of International Domain Names | |||
| (IDN) [RFC3490]. When in use in IRIs, those names SHOULD be | (IDN) [RFC3490]. When in use in IRIs, those names SHOULD be validated | |||
| validated using the ToASCII operation defined in [RFC3490], with the | using the ToASCII operation defined in [RFC3490], with the flags | |||
| flags "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing | "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing an | |||
| an invalid IDN cannot successfully be resolved. For legibility | invalid IDN cannot successfully be resolved. For legibility purposes, | |||
| purposes, IDN components of IRIs SHOULD NOT be converted into ASCII | IDN components of IRIs SHOULD NOT be converted into ASCII Compatible | |||
| Compatible Encoding (ACE). | Encoding (ACE). | |||
| 5.4 Preferred Forms | 5.4 Preferred Forms | |||
| The following are the preferred forms for IRIs when generated: | The following are the preferred forms for IRIs when created: | |||
| - Always provide the URI scheme in lowercase characters. | - Always provide the URI scheme in lowercase characters. | |||
| - Only perform percent-escaping where it is essential. | - Only perform percent-encoding where it is essential. | |||
| - Always use uppercase A-through-F characters when percent- | - Always use uppercase A-through-F characters when percent-encoding. | |||
| escaping. | ||||
| - Always provide the hostname, if any, in the form produced when | - For those schemes where ireg-name is a domain name, always provide | |||
| applying nameprep [RFC3491]. This in particular includes using | the individual labels, in the form produced when applying nameprep | |||
| lowercase characters rather than uppercase characters where | [RFC3491]. This in particular includes using lowercase characters | |||
| applicable. | rather than uppercase characters where applicable. Also, always | |||
| use US-ASCII '.' as a separator. | ||||
| - Where possible, provide IRI components in NFKC or NFC. | - Where possible, provide IRI components in NFKC or NFC. | |||
| - Prevent /./ and /../ from appearing in non-relative URI paths. | - Prevent /./ and /../ from appearing in non-relative URI paths. | |||
| - For schemes that define an empty path to be equivalent to a | - For schemes that define an empty path to be equivalent to a path | |||
| path of "/", use "/". | of "/", use "/". | |||
| 6. Use of IRIs | 6. Use of IRIs | |||
| 6.1 Limitations on UCS Characters Allowed in IRIs | 6.1 Limitations on UCS Characters Allowed in IRIs | |||
| This section discusses limitations on characters and character | This section discusses limitations on characters and character | |||
| sequences usable for IRIs. The considerations in this section are | sequences usable for IRIs. The considerations in this section are | |||
| relevant when creating IRIs and when converting from URIs to IRIs. | relevant when creating IRIs and when converting from URIs to IRIs. | |||
| a) The repertoire of characters allowed in each IRI component is | a) The repertoire of characters allowed in each IRI component is | |||
| limited by the definition of that component. For example, the | limited by the definition of that component. For example, the | |||
| definition of the scheme component does not allow characters | definition of the scheme component does not allow characters | |||
| beyond US-ASCII. | beyond US-ASCII. | |||
| (Note: In accordance with URI practice, generic IRI software | (Note: In accordance with URI practice, generic IRI software | |||
| cannot and should not check for such limitations.) | cannot and should not check for such limitations.) | |||
| b) The UCS contains many areas of characters for which there are | b) The UCS contains many areas of characters for which there are | |||
| strong visual look-alikes. Because of the likelihood of | strong visual look-alikes. Because of the likelihood of | |||
| transcription errors, these also should be avoided. This | transcription errors, these also should be avoided. This includes | |||
| includes the full-width equivalents of ASCII characters, half- | the full-width equivalents of ASCII characters, half-width | |||
| width Katakana characters for Japanese, and many others. This | Katakana characters for Japanese, and many others. This also | |||
| also includes many look-alikes of "space", "delims", and | includes many look-alikes of "space", "delims", and "unwise", | |||
| "unwise", characters excluded in [RFC3491]. | characters excluded in [RFC3491]. | |||
| Additional information is available from [UNIXML]. [UNIXML] is | Additional information is available from [UNIXML]. [UNIXML] is | |||
| written in the context of running text rather than in the context of | written in the context of running text rather than in the context of | |||
| identifiers. Nevertheless, it discusses many of the categories of | identifiers. Nevertheless, it discusses many of the categories of | |||
| characters not appropriate for IRIs. | characters not appropriate for IRIs. | |||
| 6.2 Software Interfaces and Protocols | 6.2 Software Interfaces and Protocols | |||
| Although an IRI is defined as a sequence of characters, software | Although an IRI is defined as a sequence of characters, software | |||
| interfaces for URIs typically function on sequences of octets or | interfaces for URIs typically function on sequences of octets or | |||
| skipping to change at page 23, line 52 | skipping to change at page 25, line 10 | |||
| URI-only components MUST map the IRIs per Section 3.1, when | URI-only components MUST map the IRIs per Section 3.1, when | |||
| transferring from IRI-capable to URI-only components. Such a mapping | transferring from IRI-capable to URI-only components. Such a mapping | |||
| SHOULD be applied as late as possible. It SHOULD NOT be applied | SHOULD be applied as late as possible. It SHOULD NOT be applied | |||
| between components that are known to be able to handle IRIs. | between components that are known to be able to handle IRIs. | |||
| 6.3 Format of URIs and IRIs in Documents and Protocols | 6.3 Format of URIs and IRIs in Documents and Protocols | |||
| Document formats that transport URIs may need to be upgraded to allow | Document formats that transport URIs may need to be upgraded to allow | |||
| the transport of IRIs. In those cases where the document as a whole | the transport of IRIs. In those cases where the document as a whole | |||
| has a native character encoding, IRIs MUST also be encoded in this | has a native character encoding, IRIs MUST also be encoded in this | |||
| encoding, and converted accordingly by a parser or interpreter. IRI | character encoding, and converted accordingly by a parser or | |||
| characters that are not expressible in the native encoding SHOULD be | interpreter. IRI characters that are not expressible in the native | |||
| escaped using the escaping conventions of the document format if such | character encoding SHOULD be escaped using the escaping conventions | |||
| conventions are available. Alternatively, they MAY be escaped | of the document format if such conventions are available. | |||
| according to Section 3.1. For example, in HTML or XML, numeric | Alternatively, they MAY be percent-encoded according to Section 3.1. | |||
| character references SHOULD be used. If a document as a whole has a | For example, in HTML or XML, numeric character references SHOULD be | |||
| native character encoding, and that character encoding is not UTF-8, | used. If a document as a whole has a native character encoding, and | |||
| then IRIs MUST NOT be placed into the document in the UTF-8 character | that character encoding is not UTF-8, then IRIs MUST NOT be placed | |||
| encoding. | into the document in the UTF-8 character encoding. | |||
| Note: Some formats already accommodate IRIs, although they use | Note: Some formats already accommodate IRIs, although they use | |||
| different terminology. HTML 4.0 [HTML4] defines the conversion from | different terminology. HTML 4.0 [HTML4] defines the conversion from | |||
| IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink | IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink | |||
| [XLink], and XML Schema [XMLSchema] and specifications based upon | [XLink], and XML Schema [XMLSchema] and specifications based upon | |||
| them allow IRIs. Also, it is expected that all relevant new W3C | them allow IRIs. Also, it is expected that all relevant new W3C | |||
| formats and protocols will be required to handle IRIs [CharMod]. | formats and protocols will be required to handle IRIs [CharMod]. | |||
| 6.4 Use of UTF-8 for Encoding Original Characters | 6.4 Use of UTF-8 for Encoding Original Characters | |||
| skipping to change at page 24, line 41 | skipping to change at page 25, line 48 | |||
| Examples where this is already used are the URN syntax [RFC2141], | Examples where this is already used are the URN syntax [RFC2141], | |||
| IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, | IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, | |||
| because the HTTP URL scheme does not specify how to encode original | because the HTTP URL scheme does not specify how to encode original | |||
| characters, only some HTTP URLs can have corresponding but different | characters, only some HTTP URLs can have corresponding but different | |||
| IRIs. | IRIs. | |||
| For example, for a document with a URI of | For example, for a document with a URI of | |||
| http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to | http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to | |||
| construct a corresponding IRI (in XML notation, see Section 1.4): | construct a corresponding IRI (in XML notation, see Section 1.4): | |||
| http://www.example.org/résumé.html (é stands for the | http://www.example.org/résumé.html (é stands for the | |||
| e-acute character, and %C3%A9 is the UTF-8 encoded and escaped | e-acute character, and %C3%A9 is the UTF-8 encoded and | |||
| representation of that character). On the other hand, for a document | percent-encoded representation of that character). On the other hand, | |||
| with a URI of http://www.example.org/r%E9sum%E9.html, the escaped | for a document with a URI of http://www.example.org/r%E9sum%E9.html, | |||
| octets cannot be converted to actual characters in an IRI, because | the percent-encoding octets cannot be converted to actual characters | |||
| the escaping is not based on UTF-8. | in an IRI, because the percent-encoding is not based on UTF-8. | |||
| The requirement for the use of UTF-8 applies to all parts of a URI. | The requirement for the use of UTF-8 applies to all parts of a URI | |||
| However, it is possible that the capability of IRIs to represent a | (with the potential exception of the ireg-name part, see Section | |||
| wide range of characters directly is used just in some parts of the | 3.1). However, it is possible that the capability of IRIs to | |||
| IRI (or IRI reference). The other parts of the IRI may only contain | represent a wide range of characters directly is used just in some | |||
| ASCII characters, or they may not be based on UTF-8. They may be | parts of the IRI (or IRI reference). The other parts of the IRI may | |||
| based on another encoding, or they may directly encode raw binary | only contain ASCII characters, or they may not be based on UTF-8. | |||
| data (see also [RFC2397]). | They may be based on another character encoding, or they may directly | |||
| encode raw binary data (see also [RFC2397]). | ||||
| For example, it is possible to have a URI reference of | For example, it is possible to have a URI reference of | |||
| http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9, where the | http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9, where the | |||
| document name is encoded in iso-8859-1 based on server settings, but | document name is encoded in iso-8859-1 based on server settings, but | |||
| the fragment identifier is encoded in UTF-8 according to [XPointer]. | the fragment identifier is encoded in UTF-8 according to [XPointer]. | |||
| The IRI corresponding to the above URI would be (in XML notation) | The IRI corresponding to the above URI would be (in XML notation) | |||
| http://www.example.org/r%E9sum%E9.xml#résumé. | http://www.example.org/r%E9sum%E9.xml#résumé. | |||
| Similar considerations apply to query parts. The functionality of | Similar considerations apply to query parts. The functionality of | |||
| IRIs (namely to be able to include non-ASCII characters) can only be | IRIs (namely to be able to include non-ASCII characters) can only be | |||
| skipping to change at page 25, line 30 | skipping to change at page 26, line 37 | |||
| Processing of relative forms of IRIs against a base is handled | Processing of relative forms of IRIs against a base is handled | |||
| straightforwardly; the algorithms of [RFCYYYY] can be applied | straightforwardly; the algorithms of [RFCYYYY] can be applied | |||
| directly, treating the characters additionally allowed in IRIs in the | directly, treating the characters additionally allowed in IRIs in the | |||
| same way as unreserved characters in URIs. | same way as unreserved characters in URIs. | |||
| 7. URI/IRI Processing Guidelines (informative) | 7. URI/IRI Processing Guidelines (informative) | |||
| This informative section provides guidelines for supporting IRIs in | This informative section provides guidelines for supporting IRIs in | |||
| the same software components and operations that currently process | the same software components and operations that currently process | |||
| URIs: software interfaces that handle URIs, software that allows | URIs: software interfaces that handle URIs, software that allows | |||
| users to enter URIs, software that generates URIs, software that | users to enter URIs, software that creates or generates URIs, | |||
| displays URIs, formats and protocols that transport URIs, and | software that displays URIs, formats and protocols that transport | |||
| software that interprets URIs. These may all require more or less | URIs, and software that interprets URIs. These may all require more | |||
| modification before functioning properly with IRIs. The | or less modification before functioning properly with IRIs. The | |||
| considerations in this section also apply to URI references and IRI | considerations in this section also apply to URI references and IRI | |||
| references. | references. | |||
| 7.1 URI/IRI Software Interfaces | 7.1 URI/IRI Software Interfaces | |||
| Software interfaces that handle URIs, such as URI-handling APIs and | Software interfaces that handle URIs, such as URI-handling APIs and | |||
| protocols transferring URIs, need interfaces and protocol elements | protocols transferring URIs, need interfaces and protocol elements | |||
| that are designed to carry IRIs. | that are designed to carry IRIs. | |||
| In case the current handling in an API or protocol is based on US- | In case the current handling in an API or protocol is based on | |||
| ASCII, UTF-8 is recommended as the encoding for IRIs, because this is | US-ASCII, UTF-8 is recommended as the character encoding for IRIs, | |||
| compatible with US-ASCII, is in accordance with the recommendations | because this is compatible with US-ASCII, is in accordance with the | |||
| of [RFC2277], and makes it easy to convert to URIs where necessary. | recommendations of [RFC2277], and makes it easy to convert to URIs | |||
| In any case, the API or protocol definition must clearly define the | where necessary. In any case, the API or protocol definition must | |||
| encoding to be used. | clearly define the character encoding to be used. | |||
| The transfer from URI-only to IRI-capable components requires no | The transfer from URI-only to IRI-capable components requires no | |||
| mapping, although the conversion described in Section 3.2 above may | mapping, although the conversion described in Section 3.2 above may | |||
| be performed. It is preferable not to perform this inverse | be performed. It is preferable not to perform this inverse conversion | |||
| conversion when there is a chance that this cannot be done correctly. | when there is a chance that this cannot be done correctly. | |||
| 7.2 URI/IRI Entry | 7.2 URI/IRI Entry | |||
| There are components that allow users to enter URIs into the system, | There are components that allow users to enter URIs into the system, | |||
| for example by typing or dictation. This software must be updated to | for example by typing or dictation. This software must be updated to | |||
| allow for IRI entry. | allow for IRI entry. | |||
| A person viewing a visual representation of an IRI (as a sequence of | A person viewing a visual representation of an IRI (as a sequence of | |||
| glyphs, in some order, in some visual display) or hearing an IRI, | glyphs, in some order, in some visual display) or hearing an IRI, | |||
| will use a entry method for characters in the user's language to | will use a entry method for characters in the user's language to | |||
| skipping to change at page 27, line 6 | skipping to change at page 28, line 14 | |||
| 7.3 URI/IRI Transfer Between Applications | 7.3 URI/IRI Transfer Between Applications | |||
| Many applications, in particular many mail user agents, try to detect | Many applications, in particular many mail user agents, try to detect | |||
| URIs appearing in plain text. For this, they use some heuristics | URIs appearing in plain text. For this, they use some heuristics | |||
| based on URI syntax. They then allow the user to click on such URIs | based on URI syntax. They then allow the user to click on such URIs | |||
| and retrieve the corresponding resource in an appropriate (usually | and retrieve the corresponding resource in an appropriate (usually | |||
| scheme-dependent) application. | scheme-dependent) application. | |||
| Such applications have to be upgraded to use the IRI syntax rather | Such applications have to be upgraded to use the IRI syntax rather | |||
| than the URI syntax as a base for heuristics. In particular, a non- | than the URI syntax as a base for heuristics. In particular, a | |||
| ASCII character should not be taken as the indication of the end of | non-ASCII character should not be taken as the indication of the end | |||
| an IRI. Such applications also have to make sure that they correctly | of an IRI. Such applications also have to make sure that they | |||
| convert the detected IRI from the encoding of the document or | correctly convert the detected IRI from the character encoding of the | |||
| application where the IRI appears to the encoding used by the system- | document or application where the IRI appears to the character | |||
| wide IRI invocation mechanism, or to a URI (according to Section 3.1) | encoding used by the system-wide IRI invocation mechanism, or to a | |||
| if the system-wide invocation mechanism only accepts URIs. | URI (according to Section 3.1) if the system-wide invocation | |||
| mechanism only accepts URIs. | ||||
| The clipboard is another frequently used way to transfer URIs and | The clipboard is another frequently used way to transfer URIs and | |||
| IRIs from one application to another. On most platforms, the | IRIs from one application to another. On most platforms, the | |||
| clipboard is able to store and transfer text in many languages and | clipboard is able to store and transfer text in many languages and | |||
| scripts. Correctly used, the clipboard transfers characters, not | scripts. Correctly used, the clipboard transfers characters, not | |||
| bytes, which will do the right thing with IRIs. | bytes, which will do the right thing with IRIs. | |||
| 7.4 URI/IRI Generation | 7.4 URI/IRI Generation | |||
| Systems that offer resources through the Internet, where those | Systems that offer resources through the Internet, where those | |||
| skipping to change at page 27, line 37 | skipping to change at page 28, line 46 | |||
| Many legacy character encodings are in use in various file systems. | Many legacy character encodings are in use in various file systems. | |||
| Many currently deployed systems do not transform the local character | Many currently deployed systems do not transform the local character | |||
| representation of the underlying system before generating URIs. | representation of the underlying system before generating URIs. | |||
| For maximum interoperability, systems that generate resource | For maximum interoperability, systems that generate resource | |||
| identifiers should do the appropriate transformations. For example, | identifiers should do the appropriate transformations. For example, | |||
| if a file system contains a file named résumé.html, a | if a file system contains a file named résumé.html, a | |||
| server should expose this as r%C3%A9sum%C3%A9.html in a URI, which | server should expose this as r%C3%A9sum%C3%A9.html in a URI, which | |||
| allows to use résumé.html in an IRI, even if the file name | allows to use résumé.html in an IRI, even if the file name | |||
| locally is kept in an encoding other than UTF-8. | locally is kept in a character encoding other than UTF-8. | |||
| This recommendation in particular applies to HTTP servers. For FTP | This recommendation in particular applies to HTTP servers. For FTP | |||
| servers, similar considerations apply, see in particular [RFC2640]. | servers, similar considerations apply, see in particular [RFC2640]. | |||
| 7.5 URI/IRI Selection | 7.5 URI/IRI Selection | |||
| In some cases, resource owners and publishers have control over the | In some cases, resource owners and publishers have control over the | |||
| IRIs used to identify their resources. Such control is mostly | IRIs used to identify their resources. Such control is mostly | |||
| executed by controlling the resource names, such as file names, | executed by controlling the resource names, such as file names, | |||
| directly. | directly. | |||
| In such cases, it is recommended to avoid choosing IRIs that are | In such cases, it is recommended to avoid choosing IRIs that are | |||
| easily confused. For example, for US-ASCII, the lower-case ell "l" | easily confused. For example, for US-ASCII, the lower-case ell "l" is | |||
| is easily confused with the digit one "1", and the upper-case oh "O" | easily confused with the digit one "1", and the upper-case oh "O" is | |||
| is easily confused with the digit zero "0". Publishers should avoid | easily confused with the digit zero "0". Publishers should avoid | |||
| confusing users with "br0ken" or "1ame" identifiers. | confusing users with "br0ken" or "1ame" identifiers. | |||
| Outside of the US-ASCII range, there are many more opportunities for | Outside of the US-ASCII range, there are many more opportunities for | |||
| confusion; a complete set of guidelines is too lengthy to include | confusion; a complete set of guidelines is too lengthy to include | |||
| here. As long as names are limited to characters from a single | here. As long as names are limited to characters from a single | |||
| script, native writers of a given script or language will know best | script, native writers of a given script or language will know best | |||
| when ambiguities can appear, and how they can be avoided. What may | when ambiguities can appear, and how they can be avoided. What may | |||
| look ambiguous to a stranger may be completely obvious to the average | look ambiguous to a stranger may be completely obvious to the average | |||
| native user. On the other hand, in some cases, the UCS contains | native user. On the other hand, in some cases, the UCS contains | |||
| variants for compatibility reasons, for example for typographic | variants for compatibility reasons, for example for typographic | |||
| skipping to change at page 28, line 27 | skipping to change at page 29, line 39 | |||
| As an example, the UCS contains the 'fi' ligature at U+FB01 for | As an example, the UCS contains the 'fi' ligature at U+FB01 for | |||
| compatibility reasons. Wherever possible, IRIs should use the two | compatibility reasons. Wherever possible, IRIs should use the two | |||
| letters 'f' and 'i' rather than the 'fi' ligature. An example where | letters 'f' and 'i' rather than the 'fi' ligature. An example where | |||
| the latter may be used is in the query part of an IRI for an explicit | the latter may be used is in the query part of an IRI for an explicit | |||
| search for a word written containing the 'fi' ligature. | search for a word written containing the 'fi' ligature. | |||
| In certain cases, there is a chance that characters from different | In certain cases, there is a chance that characters from different | |||
| scripts look the same. The best known example is the Latin 'A', the | scripts look the same. The best known example is the Latin 'A', the | |||
| Greek 'Alpha', and the Cyrillic 'A'. To avoid such cases, only IRIs | Greek 'Alpha', and the Cyrillic 'A'. To avoid such cases, only IRIs | |||
| should be generated where all the characters in a single component | should be created where all the characters in a single component are | |||
| are used together in a given language. This usually means that all | used together in a given language. This usually means that all these | |||
| these characters will be from the same script, but there are | characters will be from the same script, but there are languages that | |||
| languages that mix characters from different scripts (such as | mix characters from different scripts (such as Japanese). This is | |||
| Japanese). This is similar to the heuristics used to distinguish | similar to the heuristics used to distinguish between letters and | |||
| between letters and numbers in the examples above. Also, for Latin, | numbers in the examples above. Also, for Latin, Greek, and Cyrillic, | |||
| Greek, and Cyrillic, using lower-case letters results in fewer | using lower-case letters results in fewer ambiguities than using | |||
| ambiguities than using upper-case letters. | upper-case letters. | |||
| 7.6 Display of URIs/IRIs | 7.6 Display of URIs/IRIs | |||
| In situations where the rendering software is not expected to display | In situations where the rendering software is not expected to display | |||
| non-ASCII parts of the IRI correctly using the available layout and | non-ASCII parts of the IRI correctly using the available layout and | |||
| font resources, these parts should be escaped before being displayed. | font resources, these parts should be percent-encoded before being | |||
| displayed. | ||||
| For display of Bidi IRIs, please see Section 4.1. | For display of Bidi IRIs, please see Section 4.1. | |||
| 7.7 Interpretation of URIs and IRIs | 7.7 Interpretation of URIs and IRIs | |||
| Software that interprets IRIs as the names of local resources should | Software that interprets IRIs as the names of local resources should | |||
| accept IRIs in multiple forms, and convert and match them with the | accept IRIs in multiple forms, and convert and match them with the | |||
| appropriate local resource names. | appropriate local resource names. | |||
| First, multiple representations include both IRIs in the native | First, multiple representations include both IRIs in the native | |||
| character encoding of the protocol and also their URI counterparts. | character encoding of the protocol and also their URI counterparts. | |||
| Second, it may include URIs constructed based on other character | Second, it may include URIs constructed based on other character | |||
| encodings than UTF-8. Such URIs may be produced by user agents that | encodings than UTF-8. Such URIs may be produced by user agents that | |||
| do not conform to this specification and use legacy encodings to | do not conform to this specification and use legacy character | |||
| convert non-ASCII characters to URIs. Whether this is necessary and | encodings to convert non-ASCII characters to URIs. Whether this is | |||
| what character encodings to cover, depends on a number of factors, | necessary and what character encodings to cover, depends on a number | |||
| such as the legacy character encodings used locally and the | of factors, such as the legacy character encodings used locally and | |||
| distribution of various versions of user agents. For example, | the distribution of various versions of user agents. For example, | |||
| software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in | software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in | |||
| addition to UTF-8. | addition to UTF-8. | |||
| Third, it may include additional mappings to be more user-friendly | Third, it may include additional mappings to be more user-friendly | |||
| and robust against transmission errors. These would be similar to | and robust against transmission errors. These would be similar to how | |||
| how currently some servers treat URIs as case-insensitive, or perform | currently some servers treat URIs as case-insensitive, or perform | |||
| additional matching to account for spelling errors. For characters | additional matching to account for spelling errors. For characters | |||
| beyond the ASCII repertoire, this may for example include ignoring | beyond the ASCII repertoire, this may for example include ignoring | |||
| the accents on received IRIs or resource names where appropriate. | the accents on received IRIs or resource names where appropriate. | |||
| Please note that such mappings, including case mappings, are | Please note that such mappings, including case mappings, are | |||
| language-dependent. | language-dependent. | |||
| It can be difficult to unambiguously identify a resource if too many | It can be difficult to unambiguously identify a resource if too many | |||
| mappings are taken into consideration. However, escaped and non- | mappings are taken into consideration. However, percent-encoded and | |||
| escaped parts of IRIs can always clearly be distinguished. Also, the | not percent-encoded parts of IRIs can always clearly be | |||
| regularity of UTF-8 (see [Duerst97]) makes the potential for | distinguished. Also, the regularity of UTF-8 (see [Duerst97]) makes | |||
| collisions lower than it may seem at first sight. | the potential for collisions lower than it may seem at first sight. | |||
| 7.8 Upgrading Strategy | 7.8 Upgrading Strategy | |||
| Where this recommendation places further constraints on software for | Where this recommendation places further constraints on software for | |||
| which many instances are already deployed, it is important to | which many instances are already deployed, it is important to | |||
| introduce upgrades carefully, and to be aware of the various | introduce upgrades carefully, and to be aware of the various | |||
| interdependencies. | interdependencies. | |||
| If IRIs cannot be interpreted correctly, they should not be generated | If IRIs cannot be interpreted correctly, they should not be created, | |||
| or transported. This suggests that upgrading URI interpreting | generated, or transported. This suggests that upgrading URI | |||
| software to accept IRIs should have highest priority. | interpreting software to accept IRIs should have highest priority. | |||
| On the other hand, a single IRI is interpreted only by a single or | On the other hand, a single IRI is interpreted only by a single or | |||
| very few interpreters that are known in advance, while it may be | very few interpreters that are known in advance, while it may be | |||
| entered and transported very widely. | entered and transported very widely. | |||
| Therefore, IRIs benefit most from a broad upgrade of software to be | Therefore, IRIs benefit most from a broad upgrade of software to be | |||
| able to enter and transport IRIs, but before publishing any | able to enter and transport IRIs, but before publishing any | |||
| individual IRI, care should be taken to upgrade the corresponding | individual IRI, care should be taken to upgrade the corresponding | |||
| interpreting software in order to cover the forms expected to be | interpreting software in order to cover the forms expected to be | |||
| received by various versions of entry and transport software. | received by various versions of entry and transport software. | |||
| The upgrade of generating software to generate IRIs instead of a | The upgrade of generating software to generate IRIs instead of using | |||
| local encoding should happen only after the service is upgraded to | a local character encoding should happen only after the service is | |||
| accept IRIs. Similarly, IRIs should only be generated when the | upgraded to accept IRIs. Similarly, IRIs should only be generated | |||
| service accepts IRIs and the intervening infrastructure and protocol | when the service accepts IRIs and the intervening infrastructure and | |||
| is known to transport them safely. | protocol is known to transport them safely. | |||
| Display software should be upgraded only after upgraded entry | Display software should be upgraded only after upgraded entry | |||
| software has been widely deployed to the population that will see the | software has been widely deployed to the population that will see the | |||
| displayed result. | displayed result. | |||
| It is often possible to reduce the effort and dependencies for | ||||
| upgrading to IRIs by using UTF-8 rather than another character | ||||
| encoding where there is a free choice of character encodings. For | ||||
| example, when setting up a new file-based Web server, using UTF-8 as | ||||
| the character encoding for file names will make the transition to | ||||
| IRIs easier. Likewise, when setting up a new Web form using UTF-8 as | ||||
| the character encoding of the form page, the returned query URIs will | ||||
| use UTF-8 as the character encoding (unless the user, for whatever | ||||
| reason, changes the character encoding) and will therefore be | ||||
| compatible with IRIs. | ||||
| These recommendations, when taken together, will allow for the | These recommendations, when taken together, will allow for the | |||
| extension from URIs to IRIs in order to handle scripts other than | extension from URIs to IRIs in order to handle scripts other than | |||
| ASCII while minimizing interoperability problems. | ASCII while minimizing interoperability problems. | |||
| 8. Security Considerations | 8. Security Considerations | |||
| Incorrect escaping or unescaping can lead to security problems. In | The security considerations discussed in [RFCYYYY] also apply to | |||
| IRIs. In addition, the following issues require particular care for | ||||
| IRIs. | ||||
| Incorrect encoding or decoding can lead to security problems. In | ||||
| particular, some UTF-8 decoders do not check against overlong byte | particular, some UTF-8 decoders do not check against overlong byte | |||
| sequences. As an example, a '/' is encoded with the byte 0x2F both | sequences. As an example, a '/' is encoded with the byte 0x2F both in | |||
| in UTF-8 and in ASCII, but some UTF-8 decoders also wrongly interpret | UTF-8 and in ASCII, but some UTF-8 decoders also wrongly interpret | |||
| the sequence 0xC0 0xAF as a '/'. A sequence such as '%C0%AF..' may | the sequence 0xC0 0xAF as a '/'. A sequence such as '%C0%AF..' may | |||
| pass some security tests and then be interpreted as '/..' in a path | pass some security tests and then be interpreted as '/..' in a path | |||
| if UTF-8 decoders are fault-tolerant, if conversion and checking are | if UTF-8 decoders are fault-tolerant, if conversion and checking are | |||
| not done in the right order, and/or if reserved characters and | not done in the right order, and/or if reserved characters and | |||
| unreserved characters are not clearly distinguished. | unreserved characters are not clearly distinguished. | |||
| There are various ways in which "spoofing" can occur with IRIs. | There are various ways in which "spoofing" can occur with IRIs. | |||
| "Spoofing" means that somebody may add a resource name that looks the | "Spoofing" means that somebody may add a resource name that looks the | |||
| same or similar to the user, but points to a different resource. The | same or similar to the user, but points to a different resource. The | |||
| added resource may pretend to be the real resource by looking very | added resource may pretend to be the real resource by looking very | |||
| similar, but may contain all kinds of changes that may be difficult | similar, but may contain all kinds of changes that may be difficult | |||
| to spot and can cause all kinds of problems. Most spoofing | to spot and can cause all kinds of problems. Most spoofing | |||
| possibilities for IRIs are extensions of those for URIs. | possibilities for IRIs are extensions of those for URIs. | |||
| Spoofing can occur for various reasons. A first reason is that | Spoofing can occur for various reasons. A first reason is that | |||
| normalization expectations of a user or actual normalization when | normalization expectations of a user or actual normalization when | |||
| entering an IRI, or when transcoding an IRI from a legacy encoding, | entering an IRI, or when transcoding an IRI from a legacy character | |||
| do not match the normalization used on the server side. | encoding, do not match the normalization used on the server side. | |||
| Conceptually, this is no different from the problems surrounding the | Conceptually, this is no different from the problems surrounding the | |||
| use of case-insensitive web servers. For example, a popular web page | use of case-insensitive web servers. For example, a popular web page | |||
| with a mixed case name (http://big.site/PopularPage.html) might be | with a mixed case name (http://big.site/PopularPage.html) might be | |||
| "spoofed" by someone who is able to create http://big.site/ | "spoofed" by someone who is able to create http://big.site/ | |||
| popularpage.html. However, the introduction of character | popularpage.html. However, the use of unnormalized character | |||
| normalization, and of additional mappings for user convenience, may | sequences, and of additional mappings for user convenience, may | |||
| increase the chance for spoofing. Protocols and servers that allow | increase the chance for spoofing. Protocols and servers that allow | |||
| the creation of resources with unnormalized names, and resources with | the creation of resources with unnormalized names, and resources with | |||
| names that are not normalized, are particularly vulnerable to such | names that are not normalized, are particularly vulnerable to such | |||
| attacks. This is an inherent security problem of the relevant | attacks. This is an inherent security problem of the relevant | |||
| protocol, server, or resource, and not specific to IRIs, but | protocol, server, or resource, and not specific to IRIs, but | |||
| mentioned here for completeness. | mentioned here for completeness. | |||
| Spoofing can occur in various IRI components, such as the domain name | Spoofing can occur in various IRI components, such as the domain name | |||
| part or a path part. For considerations specific to the domain name | part or a path part. For considerations specific to the domain name | |||
| part, see [RFC3491]. For the path part, administrators of sites | part, see [RFC3491]. For the path part, administrators of sites which | |||
| which allow independent users to create resources in the same subarea | allow independent users to create resources in the same subarea may | |||
| may need to be careful to check for spoofing. | need to be careful to check for spoofing. | |||
| Spoofing can occur because in the UCS, there are many characters that | Spoofing can occur because in the UCS, there are many characters that | |||
| look very similar. Details are discussed in Section 7.5. Again, | look very similar. Details are discussed in Section 7.5. Again, this | |||
| this is very similar to spoofing possibilities on US-ASCII, e.g. | is very similar to spoofing possibilities on US-ASCII, e.g. using | |||
| using 'br0ken' or '1ame' URIs. | 'br0ken' or '1ame' URIs. | |||
| Spoofing can occur when URIs in various encodings are accepted to | Spoofing can occur when URIs with percent-encodings based on various | |||
| deal with older user agents. In some cases, in particular for Latin- | character encodings are accepted to deal with older user agents. In | |||
| based resource names, this is usually easy to detect because UTF-8- | some cases, in particular for Latin-based resource names, this is | |||
| encoded names, when interpreted and viewed as legacy encodings, | usually easy to detect because UTF-8-encoded names, when interpreted | |||
| produce mostly garbage. In other cases, when concurrently used | and viewed as legacy character encodings, produce mostly garbage. In | |||
| encodings have a similar structure, but there are no characters that | other cases, when concurrently used character encodings have a | |||
| have exactly the same encoding, detection is more difficult. | similar structure, but there are no characters that have exactly the | |||
| same encoding, detection is more difficult. | ||||
| Spoofing can occur with bidirectional IRIs, if the restrictions in | Spoofing can occur with bidirectional IRIs, if the restrictions in | |||
| Section 4.2 are not followed. The same visual representation may be | Section 4.2 are not followed. The same visual representation may be | |||
| interpreted as different logical representations, and vice versa. It | interpreted as different logical representations, and vice versa. It | |||
| is also very important that a correct Unicode bidirectional | is also very important that a correct Unicode bidirectional | |||
| implementation is used. | implementation is used. | |||
| 9. Acknowledgements | 9. Acknowledgements | |||
| We would like to thank Larry Masinter for his work as coauthor of | We would like to thank Larry Masinter for his work as coauthor of | |||
| many earlier versions of this document (draft-masinter-url-i18n-xx). | many earlier versions of this document (draft-masinter-url-i18n-xx). | |||
| The discussion on the issue addressed here has started a long time | The discussion on the issue addressed here has started a long time | |||
| ago. There was a thread in the HTML working group in August 1995 | ago. There was a thread in the HTML working group in August 1995 | |||
| (under the topic of "Globalizing URIs") and in the www-international | (under the topic of "Globalizing URIs") and in the www-international | |||
| mailing list in July 1996 (under the topic of "Internationalization | mailing list in July 1996 (under the topic of "Internationalization | |||
| and URLs"), and ad-hoc meetings at the Unicode conferences in | and URLs"), and ad-hoc meetings at the Unicode conferences in | |||
| September 1995 and September 1997. | September 1995 and September 1997. | |||
| Thanks to Francois Yergeau, Matti Allouche, Roy Fielding, Tim | Many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding, | |||
| Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim | Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim | |||
| Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie | Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie | |||
| Daigle, Ted Hardie, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex | Daigle, Ted Hardie, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex | |||
| Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam | Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam | |||
| Costello, Dan Oscarson, Elliotte Rusty Harold, Mike J. Brown, Andrea | Costello, Dan Oscarson, Elliotte Rusty Harold, Mike J. Brown, Andrea | |||
| Vine, Roy Badami, Simon Josefsson, Carlos Viegas Damasio, and many | Vine, Roy Badami, Jonathan Rosenne, Asmus Freytag, Simon Josefsson, | |||
| Carlos Viegas Damasio, Chris Haynes, Walter Underwood, and many | ||||
| others for help with understanding the issues and possible solutions, | others for help with understanding the issues and possible solutions, | |||
| and getting the details right. Thanks also to the members of the W3C | and getting the details right. Thanks also to the members of the W3C | |||
| I18N Working Group and Interest Group for their contributions and | I18N Working Group and Interest Group for their contributions and | |||
| their work on [CharMod], to the members of many other W3C WGs for | their work on [CharMod], to the members of many other W3C WGs for | |||
| adopting the ideas, and to the members of the Montreal IAB Workshop | adopting IRIs, and to the members of the Montreal IAB Workshop on | |||
| on Internationalization and Localization for their review. | Internationalization and Localization for their review. | |||
| Normative References | 10. References | |||
| [ISO10646] International Organization for Standardization, | 10.1 Normative References | |||
| [ISO10646] | ||||
| International Organization for Standardization, | ||||
| "Information Technology - Universal Multiple-Octet Coded | "Information Technology - Universal Multiple-Octet Coded | |||
| Character Set (UCS) - Part 1: Architecture and Basic | Character Set (UCS) - Part 1: Architecture and Basic | |||
| Multilingual Plane - Part 2: Supplementary Planes", ISO | Multilingual Plane - Part 2: Supplementary Planes", ISO | |||
| Standard 10646, with amendment, July 2002. | Standard 10646, with amendment, July 2002. | |||
| [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | |||
| Specifications: ABNF", RFC 2234, November 1997. | Specifications: ABNF", RFC 2234, November 1997. | |||
| [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, | [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, | |||
| "Internationalizing Domain Names in Applications (IDNA)", | "Internationalizing Domain Names in Applications (IDNA)", | |||
| skipping to change at page 32, line 32 | skipping to change at page 34, line 17 | |||
| [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | |||
| Profile for Internationalized Domain Names (IDN)", RFC | Profile for Internationalized Domain Names (IDN)", RFC | |||
| 3491, March 2003. | 3491, March 2003. | |||
| [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO | [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO | |||
| 10646", STD 63, RFC 3629, November 2003, <http:// | 10646", STD 63, RFC 3629, November 2003, <http:// | |||
| www.ietf.org/rfc/rfc3629.txt>. | www.ietf.org/rfc/rfc3629.txt>. | |||
| [RFCYYYY] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform | [RFCYYYY] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform | |||
| Resource Identifier (URI): Generic Syntax", draft- | Resource Identifier (URI): Generic Syntax", | |||
| fielding-uri-rfc2396bis-03.txt (work in progress), June | draft-fielding-uri-rfc2396bis-03.txt (work in progress), | |||
| 2003. | June 2003. | |||
| [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | |||
| Unicode Standard Annex #15, March 2001, <http:// | Unicode Standard Annex #15, March 2001, <http:// | |||
| www.unicode.org/unicode/reports/tr15/tr15-21.html>. | www.unicode.org/unicode/reports/tr15/tr15-21.html>. | |||
| Non-normative References | 10.2 Non-normative References | |||
| [BidiEx] "Examples of bidirectional IRIs", <http://www.w3.org/ | [BidiEx] "Examples of bidirectional IRIs", <http://www.w3.org/ | |||
| International/iri-edit/BidiExamples>. | International/iri-edit/BidiExamples>. | |||
| [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M. and T. | [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M. and T. | |||
| Texin, "Character Model for the World Wide Web", | Texin, "Character Model for the World Wide Web", World | |||
| World Wide Web Consortium Working Draft, August 2003, | Wide Web Consortium Working Draft, August 2003, <http:// | |||
| <http://www.w3.org/TR/charmod>. | www.w3.org/TR/charmod>. | |||
| [Duerst97] Duerst, M., "The Properties and Promises of UTF-8", | [Duerst01] | |||
| Proc. 11th International Unicode Conference, San Jose | Duerst, M., "Internationalized Resource Identifiers: From | |||
| , September 1997, <http://www.ifi.unizh.ch/mml/ | Specification to Testing", Proc. 19th International | |||
| mduerst/papers/PDF/IUC11-UTF-8.pdf>. | Unicode Conference, San Jose , September 2001, <http:// | |||
| www.w3.org/2001/Talks/0912-IUC-IRI/paper.html>. | ||||
| [Duerst01] Duerst, M., "Internationalized Resource Identifiers: | [Duerst97] | |||
| From Specification to Testing", Proc. 19th | Duerst, M., "The Properties and Promises of UTF-8", Proc. | |||
| International Unicode Conference, San Jose , | 11th International Unicode Conference, San Jose , | |||
| September 2001, <http://www.w3.org/2001/Talks/0912- | September 1997, <http://www.ifi.unizh.ch/mml/mduerst/ | |||
| IUC-IRI/paper.html>. | papers/PDF/IUC11-UTF-8.pdf>. | |||
| [Gettys] Gettys, J., "URI Model Consequences", <http:// | [Gettys] Gettys, J., "URI Model Consequences", <http://www.w3.org/ | |||
| www.w3.org/DesignIssues/ModelConsequences>. | DesignIssues/ModelConsequences>. | |||
| [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 | [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 | |||
| Specification", World Wide Web Consortium | Specification", World Wide Web Consortium Recommendation, | |||
| Recommendation, December 1999, <http://www.w3.org/TR/ | December 1999, <http://www.w3.org/TR/REC-html40/appendix/ | |||
| REC-html40/appendix/notes.html#h-B.2>. | notes.html#h-B.2>. | |||
| [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
| Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
| [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, | [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., | |||
| H., Atkinson, R., Crispin, M. and P. Svanberg, "The | Atkinson, R., Crispin, M. and P. Svanberg, "The Report of | |||
| Report of the IAB Character Set Workshop held 29 | the IAB Character Set Workshop held 29 February - 1 March, | |||
| February - 1 March, 1996", RFC 2130, April 1997. | 1996", RFC 2130, April 1997. | |||
| [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | |||
| [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September | [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997. | |||
| 1997. | ||||
| [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | |||
| Languages", BCP 18, RFC 2277, January 1998. | Languages", BCP 18, RFC 2277, January 1998. | |||
| [RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998. | [RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998. | |||
| [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, | [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform | |||
| "Uniform Resource Identifiers (URI): Generic Syntax", | Resource Identifiers (URI): Generic Syntax", RFC 2396, | |||
| RFC 2396, August 1998. | ||||
| [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, | ||||
| August 1998. | August 1998. | |||
| [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, August | ||||
| 1998. | ||||
| [RFC2616] Fielding, R., Gettys, J., Mogul, J., Nielsen, H., | [RFC2616] Fielding, R., Gettys, J., Mogul, J., Nielsen, H., | |||
| Masinter, L., Leach, P. and T. Berners-Lee, | Masinter, L., Leach, P. and T. Berners-Lee, "Hypertext | |||
| "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, | Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. | |||
| June 1999. | ||||
| [RFC2640] Curtin, B., "Internationalization of the File | [RFC2640] Curtin, B., "Internationalization of the File Transfer | |||
| Transfer Protocol", RFC 2640, July 1999. | Protocol", RFC 2640, July 1999. | |||
| [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. | [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke, | |||
| Petke, "Guidelines for new URL Schemes", RFC 2718, | "Guidelines for new URL Schemes", RFC 2718, November 1999. | |||
| November 1999. | ||||
| [UNIV4] The Unicode Consortium, "The Unicode Standard, | [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard | |||
| Version 4.0", Addison-Wesley, Reading, MA , 2003. | Annex #9, March 2002, <http://www.unicode.org/unicode/ | |||
| reports/tr9>. | ||||
| [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode | [UNIV4] The Unicode Consortium, "The Unicode Standard, Version | |||
| Standard Annex #9, March 2002, <http:// | 4.0", Addison-Wesley, Reading, MA , 2003. | |||
| www.unicode.org/unicode/reports/tr9>. | ||||
| [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other | [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other | |||
| Markup Languages", Unicode Technical Report #20, | Markup Languages", Unicode Technical Report #20, World | |||
| World Wide Web Consortium Note, February 2002, | Wide Web Consortium Note, February 2002, <http:// | |||
| <http://www.w3.org/TR/unicode-xml/>. | www.w3.org/TR/unicode-xml/>. | |||
| [W3CIRI] Duerst, M., "Internationalization - URIs and other | [W3CIRI] Duerst, M., "Internationalization - URIs and other | |||
| identifiers", World Wide Web Consortium Note, | identifiers", World Wide Web Consortium Note, September | |||
| September 2002, <http://www.w3.org/International/O- | 2002, <http://www.w3.org/International/ | |||
| URL-and-ident.html>. | O-URL-and-ident.html>. | |||
| [XLink] DeRose, S., Maler, E. and D. Orchard, "XML Linking | [XLink] DeRose, S., Maler, E. and D. Orchard, "XML Linking | |||
| Language (XLink) Version 1.0", World Wide Web | Language (XLink) Version 1.0", World Wide Web Consortium | |||
| Consortium Recommendation, June 2001, <http:// | Recommendation, June 2001, <http://www.w3.org/TR/xlink/ | |||
| www.w3.org/TR/xlink/#link-locators>. | #link-locators>. | |||
| [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C. and E. | [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C. and E. Maler, | |||
| Maler, "Extensible Markup Language (XML) 1.0 (Second | "Extensible Markup Language (XML) 1.0 (Second Edition)", | |||
| Edition)", World Wide Web Consortium Recommendation, | World Wide Web Consortium Recommendation, including | |||
| including Erratum 26 at http://www.w3.org/XML/xml- | Erratum 26 at http://www.w3.org/XML/xml-V10-2e-errata#E26, | |||
| V10-2e-errata#E26, October 2000, <http://www.w3.org/ | October 2000, <http://www.w3.org/TR/ | |||
| TR/REC-xml#sec-external-ent>. | REC-xml#sec-external-ent>. | |||
| [XMLNamespace] Bray, T., Hollander, D. and A. Layman, "Namespaces in | [XMLNamespace] | |||
| XML", World Wide Web Consortium Recommendation, | Bray, T., Hollander, D. and A. Layman, "Namespaces in | |||
| January 1999, <http://www.w3.org/TR/REC-xml#sec- | XML", World Wide Web Consortium Recommendation, January | |||
| external-ent>. | 1999, <http://www.w3.org/TR/REC-xml#sec-external-ent>. | |||
| [XMLSchema] Biron, P. and A. Malhotra, "XML Schema Part 2: | [XMLSchema] | |||
| Datatypes", World Wide Web Consortium Recommendation, | Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes", | |||
| May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>. | World Wide Web Consortium Recommendation, May 2001, | |||
| <http://www.w3.org/TR/xmlschema-2/#anyURI>. | ||||
| [XPointer] Grosso, P., Maler, E., Marsh, J. and N. Walsh, | [XPointer] | |||
| "XPointer Framework", World Wide Web Consortium | Grosso, P., Maler, E., Marsh, J. and N. Walsh, "XPointer | |||
| Recommendation, March 2003, <http://www.w3.org/TR/ | Framework", World Wide Web Consortium Recommendation, | |||
| xptr-framework/#escaping>. | March 2003, <http://www.w3.org/TR/xptr-framework/ | |||
| #escaping>. | ||||
| Authors' Addresses | Authors' Addresses | |||
| Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever | Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever | |||
| possible, for example as "Dürst in XML and HTML.) | possible, for example as "Dürst" in XML and HTML.) | |||
| World Wide Web Consortium | World Wide Web Consortium | |||
| 200 Technology Square | 5322 Endo | |||
| Cambridge, MA 02139 | Fujisawa, Kanagawa 252-8520 | |||
| U.S.A. | Japan | |||
| Phone: +1 617 253 5509 | Phone: +81 466 49 1170 | |||
| Fax: +1 617 258 5999 | Fax: +81 466 49 1171 | |||
| EMail: mailto:duerst@w3.org | EMail: mailto:duerst@w3.org | |||
| URI: http://www.w3.org/People/D%C3%BCrst/ | URI: http://www.w3.org/People/D%C3%BCrst/ | |||
| (Note: This is the escaped form of an IRI.) | (Note: This is the percent-encoded form of an IRI.) | |||
| Michel Suignard | Michel Suignard | |||
| Microsoft Corporation | Microsoft Corporation | |||
| One Microsoft Way | One Microsoft Way | |||
| Redmond, WA 98052 | Redmond, WA 98052 | |||
| U.S.A. | U.S.A. | |||
| Phone: +1 425 882-8080 | Phone: +1 425 882-8080 | |||
| EMail: mailto:michelsu@microsoft.com | EMail: mailto:michelsu@microsoft.com | |||
| URI: http://www.suignard.com | URI: http://www.suignard.com | |||
| Full Copyright Statement | Appendix A. Design Alternatives | |||
| Copyright (C) The Internet Society (2004). All Rights Reserved. | This section shortly summarizes major design alternatives and the | |||
| reasons for why they were not chosen. | ||||
| This document and translations of it may be copied and furnished to | Appendix A.1 New Scheme(s) | |||
| others, and derivative works that comment on or otherwise explain it | ||||
| or assist in its implementation may be prepared, copied, published | ||||
| and distributed, in whole or in part, without restriction of any | ||||
| kind, provided that the above copyright notice and this paragraph are | ||||
| included on all such copies and derivative works. However, this | ||||
| document itself may not be modified in any way, such as by removing | ||||
| the copyright notice or references to the Internet Society or other | ||||
| Internet organizations, except as needed for the purpose of | ||||
| developing Internet standards in which case the procedures for | ||||
| copyrights defined in the Internet Standards process must be | ||||
| followed, or as required to translate it into languages other than | ||||
| English. | ||||
| The limited permissions granted above are perpetual and will not be | Introducing new schemes (for example httpi:, ftpi:,...) or a new | |||
| revoked by the Internet Society or its successors or assigns. | metascheme (e.g. i:, leading to URI/IRI prefixes such as i:http:, | |||
| i:ftp:,...) was proposed to make IRI-to-URI conversion | ||||
| scheme-dependent or to distinguish between percent-encodings | ||||
| resulting from IRI-to-URI conversion and percent-encodings from | ||||
| legacy character encodings. | ||||
| This document and the information contained herein is provided on an | New schemes are not needed to distinguish URIs from true IRIs (i.e. | |||
| "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING | IRIs that contain non-ASCII characters). The benefit of being able to | |||
| TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING | detect the origin of percent-encodings is marginal, also because | |||
| BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION | UTF-8 can be detected with very high reliably. Deploying new schemes | |||
| HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF | is extremely hard. Not needing new schemes for IRIs makes deployment | |||
| MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. | of IRIs vastly easier. Making conversion scheme-dependent is highly | |||
| unadvisable. Using an uniform convention for conversion from IRIs to | ||||
| URIs makes IRI implementation orthogonal from the introduction of | ||||
| acual new schemes. | ||||
| Acknowledgement | Appendix A.2 Other Character Encodings than UTF-8 | |||
| At an early stage, UTF-7 was considered as an alternative to UTF-8 | ||||
| when converting IRIs to URIs. UTF-7 would not have needed | ||||
| percent-encoding, and would in most cases have been shorter than | ||||
| percent-encoded UTF-8. | ||||
| UTF-8 avoids a double layering and overloading of the use of the "+" | ||||
| character. UTF-8 is fully compatible with US-ASCII, and has therefore | ||||
| been recommended by the IETF, and is being used widely, while UTF-7 | ||||
| has never been used much and is now clearly being discouraged. | ||||
| Appendix A.3 New Encoding Convention | ||||
| Instead of using the existing percent-encoding convention of URIs, | ||||
| which is based on octets, the idea was to create a new encoding | ||||
| convention, for example to use '%u' to introduce UCS code points. | ||||
| Using the existing octet-based percent-encoding mechanism does not | ||||
| need an upgrade of the URI syntax, and does not need corresponding | ||||
| server upgrades. | ||||
| Appendix A.4 Indicating Character Encodings in the URI/IRI | ||||
| Some proposals suggested indicating the character encodings used in | ||||
| an URI or IRI with some new syntactic convention in the URI itself, | ||||
| similar to the 'charset' parameter for emails and Web pages. As an | ||||
| example, the label in square brackets in http://www.example.org/ | ||||
| ros[iso-8859-1]é indicated that the following é had to be | ||||
| interpreted as iso-8859-1. | ||||
| Using UTF-8 only does not need an upgrade to the URI syntax. It | ||||
| avoids potentially multiple labels that have to be copied correctly | ||||
| in all cases, even on the side of a bus or on a napkin, leading to | ||||
| usability problems to the extent of being prohibitively annoying. | ||||
| Using UTF-8 only also reduces transcoding errors and confusions. | ||||
| Intellectual Property Statement | ||||
| The IETF takes no position regarding the validity or scope of any | ||||
| Intellectual Property Rights or other rights that might be claimed to | ||||
| pertain to the implementation or use of the technology described in | ||||
| this document or the extent to which any license under such rights | ||||
| might or might not be available; nor does it represent that it has | ||||
| made any independent effort to identify any such rights. Information | ||||
| on the IETF's procedures with respect to rights in IETF Documents can | ||||
| be found in BCP 78 and BCP 79. | ||||
| Copies of IPR disclosures made to the IETF Secretariat and any | ||||
| assurances of licenses to be made available, or the result of an | ||||
| attempt made to obtain a general license or permission for the use of | ||||
| such proprietary rights by implementers or users of this | ||||
| specification can be obtained from the IETF on-line IPR repository at | ||||
| http://www.ietf.org/ipr. | ||||
| The IETF invites any interested party to bring to its attention any | ||||
| copyrights, patents or patent applications, or other proprietary | ||||
| rights that may cover technology that may be required to implement | ||||
| this standard. Please address the information to the IETF at | ||||
| ietf-ipr@ietf.org. | ||||
| Disclaimer of Validity | ||||
| This document and the information contained herein are provided on an | ||||
| "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS | ||||
| OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET | ||||
| ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, | ||||
| INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE | ||||
| INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED | ||||
| WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. | ||||
| Copyright Statement | ||||
| Copyright (C) The Internet Society (2004). This document is subject | ||||
| to the rights, licenses and restrictions contained in BCP 78, and | ||||
| except as set forth therein, the authors retain all their rights. | ||||
| Acknowledgment | ||||
| Funding for the RFC Editor function is currently provided by the | Funding for the RFC Editor function is currently provided by the | |||
| Internet Society. | Internet Society. | |||
| End of changes. | ||||
This html diff was produced by rfcdiff 1.12, available from http://www.levkowetz.com/ietf/tools/rfcdiff/ | ||||