| draft-duerst-iri-03.txt | draft-duerst-iri-04.txt | |||
|---|---|---|---|---|
| Network Working Group M. Duerst | Network Working Group M. Duerst | |||
| Internet-Draft W3C | Internet-Draft W3C | |||
| Expires: August 31, 2003 M. Suignard | Expires: December 28, 2003 M. Suignard | |||
| Microsoft Corporation | Microsoft Corporation | |||
| March 2, 2003 | June 29, 2003 | |||
| Internationalized Resource Identifiers (IRIs) | Internationalized Resource Identifiers (IRIs) | |||
| draft-duerst-iri-03 | draft-duerst-iri-04 | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with | This document is an Internet-Draft and is in full conformance with | |||
| all provisions of Section 10 of RFC2026. | all provisions of Section 10 of RFC2026. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
| other groups may also distribute working documents as Internet- | other groups may also distribute working documents as Internet- | |||
| Drafts. | Drafts. | |||
| skipping to change at page 1, line 33 | skipping to change at page 1, line 33 | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at http:// | The list of current Internet-Drafts can be accessed at http:// | |||
| www.ietf.org/ietf/1id-abstracts.txt. | www.ietf.org/ietf/1id-abstracts.txt. | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
| This Internet-Draft will expire on August 31, 2003. | This Internet-Draft will expire on December 28, 2003. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (2003). All Rights Reserved. | Copyright (C) The Internet Society (2003). All Rights Reserved. | |||
| Abstract | Abstract | |||
| This document defines a new protocol element, the Internationalized | This document defines a new protocol element, the Internationalized | |||
| Resource Identifier (IRI), as a complement to the URI [RFC2396]. An | Resource Identifier (IRI), as a complement to the URI [RFCYYYY]. An | |||
| IRI is a sequence of characters from the Universal Character Set | IRI is a sequence of characters from the Universal Character Set | |||
| [ISO10646]. A mapping from IRIs to URIs is defined, which means that | [ISO10646]. A mapping from IRIs to URIs is defined, which means that | |||
| IRIs can be used instead of URIs where appropriate to identify | IRIs can be used instead of URIs where appropriate to identify | |||
| resources. | resources. | |||
| The approach of defining a new protocol element was chosen, instead | The approach of defining a new protocol element was chosen, instead | |||
| of extending or changing the definition of URIs, to allow a clear | of extending or changing the definition of URIs, to allow a clear | |||
| distinction and to avoid incompatibilities with existing software. | distinction and to avoid incompatibilities with existing software. | |||
| Guidelines for the use and deployment of IRIs in various protocols, | Guidelines for the use and deployment of IRIs in various protocols, | |||
| formats, and software components that now deal with URIs are | formats, and software components that now deal with URIs are | |||
| provided. | provided. | |||
| NOTE | NOTE | |||
| This document is a product of the Internationalization Working Group | This document is a product of the Internationalization Working Group | |||
| (I18N WG) of the World Wide Web Consortium (W3C). For general | (I18N WG) of the World Wide Web Consortium (W3C). For general | |||
| discussion, please use the www-international@w3.org mailing list | discussion, please use the public-iri@w3.org mailing list (publicly | |||
| (publicly archived at http://lists.w3.org/Archives/Public/www- | archived at http://lists.w3.org/Archives/Public/public-iri/). An | |||
| international/). For more information on the topic of this document, | issues list for this document is maintained at http://www.w3.org/ | |||
| please also see [W3CIRI] and [Duerst01]. | International/iri-edit#issues. For more information on the topic of | |||
| this document, please also see [W3CIRI] and [Duerst01]. | ||||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . . 4 | 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . . 4 | |||
| 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . . 4 | 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 5 | 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . 7 | 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . . 7 | 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . . 7 | |||
| 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . . 7 | 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . . 7 | |||
| 2.3 IRI Equivalence and Normalization . . . . . . . . . . . . . 10 | 3. Relationship between IRIs and URIs . . . . . . . . . . . . . 10 | |||
| 3. Relationship between IRIs and URIs . . . . . . . . . . . . . 11 | 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . . 10 | |||
| 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . . 12 | 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . . 12 | |||
| 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . . 14 | 3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 14 | |||
| 3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 15 | 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . 15 | |||
| 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . 16 | 4.1 Logical Storage and Visual Presentation . . . . . . . . . . 16 | |||
| 4.1 Logical Storage and Visual Presentation . . . . . . . . . . 17 | 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . 16 | |||
| 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . 17 | 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . 17 | |||
| 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . 18 | ||||
| 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 18 | 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 18 | |||
| 5. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . 20 | 5. IRI Equivalence and Comparison . . . . . . . . . . . . . . . 19 | |||
| 5.1 Limitations on UCS Characters Allowed in IRIs . . . . . . . 20 | 5.1 Simple String Comparison . . . . . . . . . . . . . . . . . . 20 | |||
| 5.2 Software Interfaces and Protocols . . . . . . . . . . . . . 21 | 5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . . 20 | |||
| 5.3 Format of URIs and IRIs in Documents and Protocols . . . . . 21 | 5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . 20 | |||
| 5.4 Relative IRI References . . . . . . . . . . . . . . . . . . 22 | 5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . . 21 | |||
| 6. URI/IRI Processing Guidelines (informative) . . . . . . . . 22 | 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 6.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . . 22 | 6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . . 22 | |||
| 6.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . . 23 | 6.2 Software Interfaces and Protocols . . . . . . . . . . . . . 22 | |||
| 6.3 URI/IRI Transfer Between Applications . . . . . . . . . . . 23 | 6.3 Format of URIs and IRIs in Documents and Protocols . . . . . 23 | |||
| 6.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . . 24 | 6.4 Use of UTF-8 for Encoding Original Characters . . . . . . . 23 | |||
| 6.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . . 24 | 6.5 Relative IRI References . . . . . . . . . . . . . . . . . . 24 | |||
| 6.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . . 25 | 7. URI/IRI Processing Guidelines (informative) . . . . . . . . 24 | |||
| 6.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . . 25 | 7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . . 24 | |||
| 6.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . . 26 | 7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . . 25 | |||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . 27 | 7.3 URI/IRI Transfer Between Applications . . . . . . . . . . . 26 | |||
| 8. Issues List . . . . . . . . . . . . . . . . . . . . . . . . 28 | 7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . . 26 | |||
| 9. Change log . . . . . . . . . . . . . . . . . . . . . . . . . 28 | 7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . . 27 | |||
| 9.1 Changes from -02 to -03 . . . . . . . . . . . . . . . . . . 28 | 7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . . 27 | |||
| 9.2 Changes from -01 to -02 . . . . . . . . . . . . . . . . . . 29 | 7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . . 28 | |||
| 9.3 Changes from -00 to -01 . . . . . . . . . . . . . . . . . . 29 | 7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . . 28 | |||
| 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 29 | 8. Security Considerations . . . . . . . . . . . . . . . . . . 29 | |||
| Normative References . . . . . . . . . . . . . . . . . . . . 30 | 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 30 | |||
| Non-normative References . . . . . . . . . . . . . . . . . . 31 | Normative References . . . . . . . . . . . . . . . . . . . . 31 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 33 | Non-normative References . . . . . . . . . . . . . . . . . . 32 | |||
| Full Copyright Statement . . . . . . . . . . . . . . . . . . 34 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 34 | |||
| Full Copyright Statement . . . . . . . . . . . . . . . . . . 35 | ||||
| 1. Introduction | 1. Introduction | |||
| 1.1 Overview and Motivation | 1.1 Overview and Motivation | |||
| A URI is defined in [RFC2396] as a sequence of characters chosen from | A URI is defined in [RFCYYYY] as a sequence of characters chosen from | |||
| a limited subset of the repertoire of US-ASCII characters. | a limited subset of the repertoire of US-ASCII characters. | |||
| The characters in URIs are frequently used for representing words of | The characters in URIs are frequently used for representing words of | |||
| natural languages. Such usage has many advantages: such URIs are | natural languages. Such usage has many advantages: such URIs are | |||
| easier to memorize, easier to interpret, easier to transcribe, easier | easier to memorize, easier to interpret, easier to transcribe, easier | |||
| to create, and easier to guess. For most languages other than | to create, and easier to guess. For most languages other than | |||
| English, however, the natural script uses characters other than A-Z. | English, however, the natural script uses characters other than A-Z. | |||
| For many people, handling Latin characters is as difficult as | For many people, handling Latin characters is as difficult as | |||
| handling the characters of other scripts is for people who use only | handling the characters of other scripts is for people who use only | |||
| the Latin alphabet. Many languages with non-Latin scripts do have | the Latin alphabet. Many languages with non-Latin scripts have | |||
| transcriptions to Latin letters and such transcriptions are now often | transcriptions to Latin letters. Such transcriptions are now often | |||
| used in URIs, but they introduce additional ambiguities. | used in URIs, but they introduce additional ambiguities. | |||
| The infrastructure for the appropriate handling of characters from | The infrastructure for the appropriate handling of characters from | |||
| local scripts is now widely deployed in local versions of operating | local scripts is now widely deployed in local versions of operating | |||
| system and application software. Software that can handle a wide | system and application software. Software that can handle a wide | |||
| variety of scripts and languages at the same time is increasingly | variety of scripts and languages at the same time is increasingly | |||
| widespread. Also, there are increasing numbers of protocols and | widespread. Also, there are increasing numbers of protocols and | |||
| formats that can carry a wide range of characters. | formats that can carry a wide range of characters. | |||
| This document defines a new protocol element, called IRI | This document defines a new protocol element, called IRI | |||
| (Internationalized Resource Identifier), by extending the syntax of | (Internationalized Resource Identifier), by extending the syntax of | |||
| URIs to a much wider repertoire of characters. It also defines | URIs to a much wider repertoire of characters. It also defines | |||
| "internationalized" versions corresponding to other constructs from | "internationalized" versions corresponding to other constructs from | |||
| [RFC2396], such as URI references. | [RFCYYYY], such as URI references. | |||
| Using characters outside of A-Z in IRIs brings with it some | Using characters outside of A-Z in IRIs brings with it some | |||
| difficulties; a discussion of potential problems and workarounds can | difficulties; a discussion of potential problems and workarounds can | |||
| be found in the later sections of this document. | be found in the later sections of this document. | |||
| 1.2 Applicability | 1.2 Applicability | |||
| IRIs are designed to be compatible with recent recommendations on URI | IRIs are designed to be compatible with recent recommendations for | |||
| syntax [RFC2718]. The compatibility is provided by providing a well | new URI schemes [RFC2718]. The compatibility is provided by | |||
| defined and deterministic mapping from the IRI character sequence to | providing a well defined and deterministic mapping from the IRI | |||
| the functionally equivalent URI character sequence. Practical use of | character sequence to the functionally equivalent URI character | |||
| IRIs (or IRI references) in place of URIs (or URI references) depends | sequence. Practical use of IRIs (or IRI references) in place of URIs | |||
| on the following conditions being met: | (or URI references) depends on the following conditions being met: | |||
| a) The protocol or format element used should be explicitly | a) The protocol or format element used should be explicitly | |||
| designated to carry IRIs. That is, the intent is not to | designated to carry IRIs. That is, the intent is not to | |||
| introduce IRIs into contexts that are not defined to accept | introduce IRIs into contexts that are not defined to accept | |||
| them. For example, XML schema [XMLSchema] has an explicit type | them. For example, XML schema [XMLSchema] has an explicit type | |||
| "anyURI" that designates the use of IRIs. | "anyURI" that designates the use of IRIs. | |||
| b) The protocol or format carrying the IRIs should have a | b) The protocol or format carrying the IRIs should have a | |||
| mechanism to represent the wide range of characters used in | mechanism to represent the wide range of characters used in | |||
| IRIs, either natively or by some protocol- or format-specific | IRIs, either natively or by some protocol- or format-specific | |||
| escaping mechanism (for example numeric character references in | escaping mechanism (for example numeric character references in | |||
| [XML1]). | [XML1]). | |||
| c) Either by definition for all the URIs of a specific URI scheme, | c) The URI corresponding to the IRI in question has to encode | |||
| or a specific part of a URI (Reference), such as the fragment | original characters into octets using UTF-8. For new URI | |||
| identifier, or at least for some specific URIs of a given | schemes, this is recommended in [RFC2718]. It can apply to a | |||
| scheme, the encoding of non-ASCII characters should be based on | whole scheme (e.g. IMAP URLs [RFC2192] and POP URLs [RFC2384], | |||
| UTF-8. For new URI schemes, this is recommended in [RFC2718]. | or the URN syntax [RFC2141]). It can apply to a specific part | |||
| This allows IRIs to be used with the URN syntax [RFC2141] as | of an URI, such as the fragment identifier (e.g. [XPointer]). | |||
| well as recent URL scheme definitions based on UTF-8, such as | It can apply to a specific URI or part(s) thereoff. For | |||
| IMAP URLs [RFC2192] and POP URLs [RFC2384]. | details, please see Section 6.4. | |||
| In cases and for pieces where an encoding other than UTF-8 is used, | ||||
| and for raw binary data encoded in URIs (see [RFC2397]), the octets | ||||
| have to be %-escaped. In these situations, the ability of IRIs to | ||||
| directly represent a wide character repertoire cannot be used. | ||||
| For example, for a document with a URI of | ||||
| http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to | ||||
| construct a corresponding IRI (in XML notation, see Section 1.4): | ||||
| http://www.example.org/résumé.html (é stands for the | ||||
| e-acute character, and is the UTF-8 encoded and escaped | ||||
| representation of that character). On the other hand, for a document | ||||
| with an URI of http://www.example.org/r%E9sum%E9.html, the escaped | ||||
| octets cannot be converted to actual characters in an IRI, because | ||||
| the escaping is based on iso-8859-1 rather than UTF-8. | ||||
| 1.3 Definitions | 1.3 Definitions | |||
| The following definitions are used in this document; they follow the | The following definitions are used in this document; they follow the | |||
| terms in [RFC2130], [RFC2277] and [ISO10646]: | terms in [RFC2130], [RFC2277] and [ISO10646]: | |||
| character: A member of a set of elements used for the | character: A member of a set of elements used for the | |||
| organization, control, or representation of data. For example, | organization, control, or representation of data. For example, | |||
| "LATIN CAPITAL LETTER A" names a character. | "LATIN CAPITAL LETTER A" names a character. | |||
| octet: an ordered sequence of eight bits considered as a unit | octet: An ordered sequence of eight bits considered as a unit | |||
| character repertoire: A set of characters (in the mathematical | character repertoire: A set of characters (in the mathematical | |||
| sense) | sense) | |||
| sequence of characters: A sequence (one after another) of | sequence of characters: A sequence (one after another) of | |||
| characters | characters | |||
| sequence of octets: A sequence (one after another) of octets | sequence of octets: A sequence (one after another) of octets | |||
| (character) encoding: A method of representing a sequence of | (character) encoding: A method of representing a sequence of | |||
| characters as a sequence of octets (maybe with variants). A | characters as a sequence of octets (maybe with variants). A | |||
| method of (unambiguously) converting a sequence of octets into | method of (unambiguously) converting a sequence of octets into | |||
| a sequence of characters. | a sequence of characters. | |||
| code point: A placeholder for a character in a character encoding, | code point: A placeholder for a character in a character encoding, | |||
| for example to encode additional characters in future versions | for example to encode additional characters in future versions | |||
| of the character encoding. | of the character encoding. | |||
| skipping to change at page 6, line 19 | skipping to change at page 6, line 6 | |||
| a sequence of characters. | a sequence of characters. | |||
| code point: A placeholder for a character in a character encoding, | code point: A placeholder for a character in a character encoding, | |||
| for example to encode additional characters in future versions | for example to encode additional characters in future versions | |||
| of the character encoding. | of the character encoding. | |||
| charset: The name of a parameter or attribute used to identify a | charset: The name of a parameter or attribute used to identify a | |||
| character encoding. | character encoding. | |||
| UCS: Universal Character Set; the coded character set defined by | UCS: Universal Character Set; the coded character set defined by | |||
| [ISO10646] and [UNIV3]. | [ISO10646] and [UNIV4]. | |||
| IRI reference: The term "IRI reference" denotes the common usage | IRI reference: The term "IRI reference" denotes the common usage | |||
| of an internationalized resource identifier. An IRI reference | of an internationalized resource identifier. An IRI reference | |||
| may be absolute or relative, and may have additional | may be absolute or relative. However, the "IRI" that results | |||
| information attached in the form of a fragement identifier. | from such a reference only includes absolute IRIs; any relative | |||
| However, the "IRI" that results from such a reference only | IRIs are resolved to their absolute form. Note that in | |||
| includes the absolute IRI after the fragment identifier (if | [RFC2396], URIs did not include fragment identifiers, but in | |||
| any) is removed and after any relative IRI is resolved to its | [RFCYYYY], fragment identifiers are part of URIs. | |||
| absolute form. | ||||
| 1.4 Notation | 1.4 Notation | |||
| RFCs and Internet Drafts currently do not allow any characters | RFCs and Internet Drafts currently do not allow any characters | |||
| outside the US-ASCII repertoire. Therefore, this document uses | outside the US-ASCII repertoire. Therefore, this document uses | |||
| various special notations to denote such characters. | various special notations to denote such characters in examples. | |||
| In text, characters outside US-ASCII are sometimes referenced by | In text, characters outside US-ASCII are sometimes referenced by | |||
| using a prefix of 'U+', followed by four to six hexadecimal digits. | using a prefix of 'U+', followed by four to six hexadecimal digits. | |||
| To represent characters outside US-ASCII in examples, this document | To represent characters outside US-ASCII in examples, this document | |||
| uses two notations called 'XML Notation' and 'Bidi Notation'. | uses two notations called 'XML Notation' and 'Bidi Notation'. | |||
| XML Notation uses leading '&#x', trailing ';', and the hexadecimal | XML Notation uses leading '&#x', trailing ';', and the hexadecimal | |||
| number of the character in the UCS in between. Example: я | number of the character in the UCS in between. Example: я | |||
| stands for CYRILLIC CAPITAL LETTER YA. In this notation, an actual | stands for CYRILLIC CAPITAL LETTER YA. In this notation, an actual | |||
| '&' is denoted by '&'. | '&' is denoted by '&'. | |||
| Bidi Notation is used for bidirectional examples: lower case ASCII | Bidi Notation is used for bidirectional examples: lower case ASCII | |||
| letters stand for Latin letters or other letters that are written | letters stand for Latin letters or other letters that are written | |||
| left-to-right, whereas upper case letters represent Arabic or Hebrew | left-to-right, whereas upper case letters represent Arabic or Hebrew | |||
| letters that are written right-to-left. | letters that are written right-to-left. | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | ||||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | ||||
| document are to be interpreted as described in [RFC2119]. | ||||
| 2. IRI Syntax | 2. IRI Syntax | |||
| This section defines the syntax of Internationalized Resource | This section defines the syntax of Internationalized Resource | |||
| Identifiers (IRIs). | Identifiers (IRIs). | |||
| As with URIs, an IRI is defined as a sequence of characters, not as a | As with URIs, an IRI is defined as a sequence of characters, not as a | |||
| sequence of octets. This definition accommodates the fact that IRIs | sequence of octets. This definition accommodates the fact that IRIs | |||
| may be written on paper or read over the radio as well as being | may be written on paper or read over the radio as well as being | |||
| transmitted over the network. The same IRI may be represented as | stored or transmitted digitally. The same IRI may be represented as | |||
| different sequences of octets in different protocols or documents if | different sequences of octets in different protocols or documents if | |||
| these protocols or documents use different character encodings (and/ | these protocols or documents use different character encodings (and/ | |||
| or transfer encodings). Using the same character encoding as the | or transfer encodings). Using the same character encoding as the | |||
| containing protocol or document assures that the characters in the | containing protocol or document assures that the characters in the | |||
| IRI can be handled (searched, converted, displayed,...) in the same | IRI can be handled (searched, converted, displayed,...) in the same | |||
| way as the rest of the protocol or document. | way as the rest of the protocol or document. | |||
| 2.1 Summary of IRI Syntax | 2.1 Summary of IRI Syntax | |||
| IRIs are defined similarly to URIs in [RFC2396] (as modified by | IRIs are defined similarly to URIs in [RFCYYYY], but the class of | |||
| [RFC2732] and [IDNURI]), but the class of unreserved characters is | unreserved characters is extended by adding the characters of the UCS | |||
| extended by adding the characters of the UCS (Universal Character | (Universal Character Set, [ISO10646]) beyond U+0080, subject to the | |||
| Set, [ISO10646]) beyond U+0080, subject to the limitations given in | limitations given in the syntax rules below and in Section 6.1. | |||
| the syntax rules below and in Section 5.1. | ||||
| Otherwise, the syntax and use of components and reserved characters | Otherwise, the syntax and use of components and reserved characters | |||
| is the same as that in [RFC2396]. All the operations defined in | is the same as that in [RFCYYYY]. All the operations defined in | |||
| [RFC2396], such as the resolution of relative URIs, can be applied to | [RFCYYYY], such as the resolution of relative URIs, can be applied to | |||
| IRIs by IRI-processing software in exactly the same way as this is | IRIs by IRI-processing software in exactly the same way as this is | |||
| done to URIs by URI-processing software. | done to URIs by URI-processing software. | |||
| Note: [RFC2396]: Uniform Resource Identifiers (URI): Generic Syntax" | Characters outside the US-ASCII range are not reserved and therefore | |||
| is being revised as [RFC2396bis]. The syntax used in this document | MUST NOT be used for syntactical purposes such as to delimit | |||
| includes bug fixes from [RFC2396bis]. | components in newly defined schemes. As an example, it is not | |||
| allowed to use U+00A2, CENT SIGN, as a delimiter in IRIs, because it | ||||
| Characters outside the US-ASCII range MUST NOT be used for | is in the 'iunreserved' category, in the same way as it is not | |||
| syntactical purposes such as to delimit components in newly defined | possible to use '-' as a delimiter, because it is in the 'unreserved' | |||
| schemes. As an example, it is not allowed to use U+00A2, CENT SIGN, | category in URIs. | |||
| as a delimiter in IRIs, because it is in the 'iunreserved' category, | ||||
| in the same way as it is not possible to use '-' as a delimiter, | ||||
| because it is in the 'unreserved' category in URIs. | ||||
| 2.2 ABNF for IRI References and IRIs | 2.2 ABNF for IRI References and IRIs | |||
| While it might be possible to define IRI references and IRIs merely | While it might be possible to define IRI references and IRIs merely | |||
| by their transformation to URI references and URIs, they can also be | by their transformation to URI references and URIs, they can also be | |||
| accepted and processed directly. Therefore, an ABNF definition for | accepted and processed directly. Therefore, an ABNF definition for | |||
| IRI references (which are the most general concept and the start of | IRI references (which are the most general concept and the start of | |||
| the grammar) and IRIs is given here. The syntax of this ABNF is | the grammar) and IRIs is given here. The syntax of this ABNF is | |||
| described in [RFC2234]. Character numbers are taken from the UCS, | described in [RFC2234]. Character numbers are taken from the UCS, | |||
| without implying any actual binary encoding. Terminals in the ABNF | without implying any actual binary encoding. Terminals in the ABNF | |||
| are characters, not bytes. | are characters, not bytes. | |||
| The following rules are different from [RFC2396]: | The following rules are different from [RFCYYYY]: | |||
| absolute-IRI-reference = absolute-IRI [ "#" ifragment ] | IRI-reference = IRI / relative-IRI | |||
| IRI-reference = [ absolute-IRI / relative-IRI ] | IRI = scheme ":" ihier-part [ "?" iquery ] [ "#" ifragment ] | |||
| [ "#" ifragment ] | ||||
| absolute-IRI = scheme ":" ( ihier-part / iopaque-part ) | ||||
| relative-IRI = [ inet-path / iabs-path / irel-path ] | ||||
| [ "?" iquery ] | ||||
| ihier-part = [ inet-path / iabs-path ] [ "?" iquery ] | absolute-IRI = scheme ":" ihier-part [ "?" iquery ] | |||
| iopaque-part = iric-no-slash *iric | ||||
| iric-no-slash = iunreserved / escaped / "[" / "]" / ";" / "?" / | relative-IRI = ihier-part [ "?" iquery ] [ "#" ifragment ] | |||
| ":" / "@" / "&" / "=" / "+" / "$" / "," | ihier-part = inet-path / iabs-path / irel-path | |||
| inet-path = "//" iauthority [ iabs-path ] | inet-path = "//" iauthority [ iabs-path ] | |||
| iabs-path = "/" ipath-segments | ||||
| irel-path = irel-segment [ iabs-path ] | ||||
| irel-segment = 1*( iunreserved / escaped / ";" / | iabs-path = "/" ipath-segments | |||
| "@" / "&" / "=" / "+" / "$" / "," ) | ||||
| iauthority = iserver / ireg-name | irel-path = ipath-segments | |||
| ireg-name = 1*( iunreserved / escaped / ";" / | iauthority = [ iuserinfo "@" ] ihost [ ":" port ] | |||
| ":" / "@" / "&" / "=" / "+" / "$" / "," ) | ||||
| iserver = [ [ iuserinfo "@" ] ihostport ] | ||||
| iuserinfo = *( iunreserved / escaped / ";" / | iuserinfo = *( iunreserved / escaped / ";" / | |||
| ":" / "&" / "=" / "+" / "$" / "," ) | ":" / "&" / "=" / "+" / "$" / "," ) | |||
| ihostport = ihost [ ":" port ] | ihost = [ IPv6reference / IPv4address / ihostname ] | |||
| ihost = IPv6reference / IPv4address / ihostname | ||||
| ihostname = idomainlabel iqualified | ||||
| iqualified = *( "." idomainlabel ) [ "." ] | ||||
| ihostname = idomainlabel [ iqualified] | ||||
| iqualified = *( "." idomainlabel ) [ "." itoplabel [ "." ] ] | ||||
| idomainlabel = <<See following production rules>> | idomainlabel = <<See following production rules>> | |||
| itoplabel = <<See following production rules>> | ||||
| ipath = [ iabs-path / iopaque-part ] | ||||
| ipath-segments = isegment *( "/" isegment ) | ipath-segments = isegment *( "/" isegment ) | |||
| isegment = *ipchar | isegment = *ipchar | |||
| ipchar = iunreserved / escaped / ";" / | ipchar = iunreserved / escaped / ";" / | |||
| ":" / "@" / "&" / "=" / "+" / "$" / "," | ":" / "@" / "&" / "=" / "+" / "$" / "," | |||
| iquery = *( ipchar / iprivate / "/" / "?" ) | iquery = *( ipchar / iprivate / "/" / "?" ) | |||
| ifragment = *( ipchar / "/" / "?" ) | ifragment = *( ipchar / "/" / "?" ) | |||
| iric = reserved / iunreserved / escaped | iric = reserved / iunreserved / escaped | |||
| iunreserved = unreserved / ucschar / iadditional | ||||
| iadditional = "<" / ">" / DQUOTE / SP / "{" / "}" / | iunreserved = unreserved / ucschar | |||
| "|" / "\" / "^" / "`" | ||||
| ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / | ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / | |||
| / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD | / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD | |||
| / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD | / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD | |||
| / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD | / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD | |||
| / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD | / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD | |||
| / %xD0000-DFFFD / %xE1000-EFFFD | / %xD0000-DFFFD / %xE1000-EFFFD | |||
| iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD | iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD | |||
| The 'idomainlabel' and 'itoplabel' production rules are as follows: | The 'idomainlabel' production rule is as follows: | |||
| The values 'idomainlabel' and 'itoplabel' are defined as a string of | The value 'idomainlabel' is defined as a string of 'ucschar' obeying | |||
| 'ucschar' obeying the following rules: | the following rules: | |||
| a) Given a string of 'ucschar' values, the ToASCII operation | a) Given a string of 'ucschar' values, the ToASCII operation | |||
| [RFCXXXX] is performed on that string with the flag | [RFC3490] is performed on that string with the flag | |||
| UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set | UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set | |||
| to FALSE for creating IRIs and set to TRUE otherwise. | to FALSE for creating IRIs and set to TRUE otherwise. | |||
| b) ToASCII is successful and results in a string conforming to | b) ToASCII is successful and results in a string conforming to | |||
| 'domainlabel' for 'idomainlabel' and 'toplabel' for 'itoplabel' | 'domainlabel' (see below). | |||
| (see below for 'domainlabel' and 'toplabel'). | ||||
| Note that the space character and various delimiters are allowed in | ||||
| IRIs and IRI references. This is further discussed in Section 5.1. | ||||
| The following are the same as [RFC2396bis]: | The following are the same as [RFCYYYY]: | |||
| scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | |||
| port = *DIGIT | port = *DIGIT | |||
| domainlabel = alphanum [ 0*61( alphanum | "-" ) alphanum ] | ||||
| toplabel = alpha [ 0*61( alphanum | "-" ) alphanum ] | ||||
| alphanum = ALPHA / DIGIT | ||||
| IPv4address = dec-octet 3( "." dec-octet ) | domainlabel = alphanum [ 0*61( alphanum | "-" ) alphanum ] | |||
| dec-octet = DIGIT / ; 0-9 | ||||
| ( %x31-39 DIGIT ) / ; 10-99 | ||||
| ( "1" 2*DIGIT ) / ; 100-199 | ||||
| ( "2" %x30-34 DIGIT ) / ; 200-249 | ||||
| ( "25" %x30-35 ) ; 250-255 | ||||
| IPv6reference = "[" IPv6address "]" | ||||
| IPv6address = ( 7( h4 ":" ) h4 ) / | ||||
| ( "::" 0*6( h4 ":" ) [ h4 ] ) / | ||||
| ( h4 "::" 0*5( h4 ":" ) [ h4 ] ) / | ||||
| ( h4 ":" h4 "::" 0*4( h4 ":" ) [ h4 ] ) / | ||||
| ( h4 2( ":" h4 ) "::" 0*3( h4 ":" ) [ h4 ] ) / | ||||
| ( h4 3( ":" h4 ) "::" 0*2( h4 ":" ) [ h4 ] ) / | ||||
| ( h4 4( ":" h4 ) "::" 0*1( h4 ":" ) [ h4 ] ) / | ||||
| ( 6( h4 ":" ) IPv4address )/ | ||||
| ( "::" 0*5( h4 ":" ) IPv4address )/ | ||||
| ( h4 "::" 0*4( h4 ":" ) IPv4address )/ | ||||
| ( h4 ":" h4 "::" 0*3( h4 ":" ) IPv4address )/ | ||||
| ( h4 2( ":" h4 ) "::" 0*2( h4 ":" ) IPv4address )/ | ||||
| ( h4 3( ":" h4 ) "::" 0*1( h4 ":" ) IPv4address ) | ||||
| h4 = 1*4HEXDIG | alphanum = ALPHA / DIGIT | |||
| reserved = "[" / "]" / ";" / "/" / "?" / | ||||
| ":" / "@" / "&" / "=" / "+" / "$" / "," | ||||
| unreserved = ALPHA / DIGIT / mark | ||||
| mark = "-" / "_" / "." / "!" / "~" / "*" / "'" / | ||||
| "(" / ")" | ||||
| escaped = "%" HEXDIG HEXDIG | IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | |||
| 2.3 IRI Equivalence and Normalization | dec-octet = DIGIT ; 0-9 | |||
| / ( %x31-39 DIGIT ) ; 10-99 | ||||
| / ( "1" 2DIGIT ) ; 100-199 | ||||
| / ( "2" %x30-34 DIGIT ) ; 200-249 | ||||
| / ( "25" %x30-35 ) ; 250-255 | ||||
| There is no general rule or procedure to decide whether two arbitrary | IPv6reference = "[" IPv6address "]" | |||
| IRIs are equivalent or not (i.e. refer to the same resource or not). | ||||
| Two IRIs that look almost the same may refer to different resources. | ||||
| Two IRIs that look completely different may refer to, and resolve to, | ||||
| the same resource. | ||||
| In some scenarios a definite answer to the question of IRI | IPv6address = 6( h4 ":" ) ls32 | |||
| equivalence is needed that is independent of the scheme used and | / "::" 5( h4 ":" ) ls32 | |||
| always can be calculated quickly and without accessing a network. An | / [ h4 ] "::" 4( h4 ":" ) ls32 | |||
| example of such a case might be XML Namespaces ([XMLNamespace]). In | / [ *1( h4 ":" ) h4 ] "::" 3( h4 ":" ) ls32 | |||
| such cases, two IRIs SHOULD be defined as equivalent if and only if | / [ *2( h4 ":" ) h4 ] "::" 2( h4 ":" ) ls32 | |||
| they are character-by-character equivalent. This is the same as | / [ *3( h4 ":" ) h4 ] "::" h4 ":" ls32 | |||
| being byte-by-byte equivalent if the character encoding for both IRIs | / [ *4( h4 ":" ) h4 ] "::" ls32 | |||
| is the same. As an example, | / [ *5( h4 ":" ) h4 ] "::" h4 | |||
| http://example.org/~user, http://example.org/%7euser, and | / [ *6( h4 ":" ) h4 ] "::" | |||
| http://example.org/%7Euser would not be equivalent under this | ||||
| definition. In such a case, the comparison function MUST NOT map the | ||||
| IRIs to URIs, because such a mapping would create something different | ||||
| under this equivalence relationship. | ||||
| It follows from the above that IRIs SHOULD NOT be modified when being | h4 = 1*4HEXDIG | |||
| transported. | ||||
| For actual resolution, differences in escaping (except for the | ls32 = ( h4 ":" h4 ) / IPv4address | |||
| escaping of reserved characters) MUST always result in the same | ||||
| resource. For example, http://example.org/~user, | ||||
| http://example.org/%7euser and http://example.org/%7Euser must | ||||
| resolve to the same resource. If this kind of equivalence is to be | ||||
| tested, the escaping of both IRIs to be compared has to be aligned, | ||||
| for example by converting both IRIs to URIs (see Section 3.1) and | ||||
| making sure that the case of the hexadecimal characters in the %- | ||||
| escape is always the same. Such conversions MUST only be done on the | ||||
| fly, without changing the original IRI. | ||||
| Specific schemes and resolution mechanisms may define additional | reserved = "/" / "?" / "#" / "[" / "]" / ";" / | |||
| equivalences. For a specific scheme, two IRIs that e.g. differ only | ":" / "@" / "&" / "=" / "+" / "$" / "," | |||
| by case may be equivalent. However, this document does not deal with | ||||
| scheme-specific issues. | ||||
| The Unicode Standard [UNIV3] defines various equivalences between | unreserved = ALPHA / DIGIT / mark | |||
| sequences of characters for various purposes. Unicode Standard Annex | ||||
| #15 [UTR15] defines various Normalization Forms for these | ||||
| equivalences. IRIs SHOULD be created using Normalization Form C | ||||
| (NFC). Equivalence of IRIs MUST rely on the assumtion that IRIs are | ||||
| appropriately pre-normalized, rather than applying normalization when | ||||
| comparing two IRIs, except when converting from a non-UCS-based | ||||
| encoding to an UCS-based encoding, where a normalizing transcoder | ||||
| using NFC MUST be used for interoperability. | ||||
| As an example, http://www.example.org/résumé.html (in XML | mark = "-" / "_" / "." / "!" / "~" / "*" / "'" / | |||
| Notation) is in NFC. On the other hand, http://www.example.org/ | "(" / ")" | |||
| résumé.html is not in NFC. The former uses precombined | ||||
| e-acute characters, the later uses 'e' characters followed by | ||||
| combining acute accents, both are defined as canonically equivalent | ||||
| in [UNIV3]. | ||||
| Various IRI schemes may allow the usage of International Domain Names | escaped = "%" HEXDIG HEXDIG | |||
| (IDN) [RFCXXXX]. When in use in IRIs, those names SHOULD be | ||||
| validated using the ToASCII operation defined in [RFCXXXX], with the | ||||
| flags "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing | ||||
| an invalid IDN cannot successfully be resolved. For legibility | ||||
| purposes, IDN components of IRIs SHOULD not be converted into ASCII | ||||
| Compatible Encoding (ACE). However, this conversion may be applied | ||||
| when mapping an IRI into an URI, see Section 3.1. | ||||
| 3. Relationship between IRIs and URIs | 3. Relationship between IRIs and URIs | |||
| IRIs are meant to replace URIs in identifying resources for | IRIs are meant to replace URIs in identifying resources for | |||
| protocols, formats and software components which use a UCS-based | protocols, formats and software components which use a UCS-based | |||
| character repertoire. These protocols and components may never need | character repertoire. These protocols and components may never need | |||
| to use URIs directly, especially when the resource identifier is used | to use URIs directly, especially when the resource identifier is used | |||
| simply for identification purposes. However, when the resource | simply for identification purposes. However, when the resource | |||
| identifier is used for resource retrieval, it is in many cases | identifier is used for resource retrieval, it is in many cases | |||
| necessary to determine the associated URI because most retrieval | necessary to determine the associated URI because most retrieval | |||
| skipping to change at page 12, line 28 | skipping to change at page 10, line 36 | |||
| This mapping has two purposes: | This mapping has two purposes: | |||
| a) Syntactical: Many URI schemes and components define additional | a) Syntactical: Many URI schemes and components define additional | |||
| syntactical restrictions not captured in Section 2.2. Such | syntactical restrictions not captured in Section 2.2. Such | |||
| restrictions can be applied to IRIs by noting that IRIs are | restrictions can be applied to IRIs by noting that IRIs are | |||
| only valid if they map to syntactically valid URIs. This means | only valid if they map to syntactically valid URIs. This means | |||
| that such syntactical restrictions do not have to be defined | that such syntactical restrictions do not have to be defined | |||
| again on the IRI level. | again on the IRI level. | |||
| b) Interpretational: URIs identify resources in various ways. | b) Interpretational: URIs identify resources in various ways. | |||
| IRIs also identify resources. When the IRI is used simply for | IRIs also identify resources. When the IRI is used solely for | |||
| identification purposes, it is not necessary to map the IRI to | identification purposes, it is not necessary to map the IRI to | |||
| an URI (see Section 2.3). However, when an IRI is used for | an URI (see Section 5). However, when an IRI is used for | |||
| resource retrieval, the resource that the IRI locates is the | resource retrieval, the resource that the IRI locates is the | |||
| same as the one located by the URI obtained after converting | same as the one located by the URI obtained after converting | |||
| the IRI according to the procedure defined here. This means | the IRI according to the procedure defined here. This means | |||
| that there is no need to define resolution separately on the | that there is no need to define resolution separately on the | |||
| IRI level. | IRI level. | |||
| Applications MUST map IRIs to URIs using the following two steps. | Applications MUST map IRIs to URIs using the following two steps. | |||
| Step 1) This step generates a UCS-based encoding from the original | Step 1) This step generates a UCS-based encoding from the original | |||
| IRI format. This step has three variants, depending on the | IRI format. This step has three variants, depending on the | |||
| form of the input. | form of the input. | |||
| Variant A) If the IRI is written on paper or read out loud, | Variant A) If the IRI is written on paper or read out loud, | |||
| or otherwise represented as a sequence of characters | or otherwise represented as a sequence of characters | |||
| independent of any encoding: Represent the IRI as a | independent of any encoding: Represent the IRI as a | |||
| sequence of characters from the UCS normalized according | sequence of characters from the UCS normalized according | |||
| to Normalization Form C (NFC, [UTR15]). | to Normalization Form C (NFC, [UTR15]). | |||
| Variant B) If the IRI is in some digital representation | Variant B) If the IRI is in some digital representation | |||
| (e.g. an octet stream) in some non-Unicode encoding: | (e.g. an octet stream) in some known non-Unicode | |||
| Convert the IRI to a sequence of characters from the UCS | encoding: Convert the IRI to a sequence of characters | |||
| normalized according to NFC. | from the UCS normalized according to NFC. | |||
| Variant C) If the IRI is in an Unicode-based encoding (for | Variant C) If the IRI is in an Unicode-based encoding (for | |||
| example UTF-8 or UTF-16): Do not normalize. Move | example UTF-8 or UTF-16): Do not normalize. Move | |||
| directly to Step 2. | directly to Step 2. | |||
| Step 2) For each character that is disallowed in URI references, | Step 2) If the IRI contains an 'ihostname' part, replace this | |||
| 'ihostname' part by the part converted using the ToASCII | ||||
| operation specified in Section 4.1 of [RFC3490], with the flag | ||||
| UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set | ||||
| to FALSE for creating IRIs and set to TRUE otherwise. | ||||
| Step 3) For each character that is disallowed in URI references, | ||||
| apply steps 1) through 3) below. The disallowed characters | apply steps 1) through 3) below. The disallowed characters | |||
| consist of all non-ASCII characters, plus the excluded | consist of all non-ASCII characters allowed in IRIs. | |||
| characters listed in Section 2.4 of [RFC2396], except for the | ||||
| number sign (#) and percent sign (%) and the square bracket | ||||
| characters re-allowed in [RFC2732]. | ||||
| 1) Convert the character to a sequence of one or more octets | 1) Convert the character to a sequence of one or more octets | |||
| using UTF-8 [RFC2279]. | using UTF-8 [RFCXXXX]. | |||
| 2) Convert each octet to %HH, where HH is the hexadecimal | 2) Convert each octet to %HH, where HH is the hexadecimal | |||
| notation of the octet value. Note: This is identical to | notation of the octet value. Note: This is identical to | |||
| the escaping mechanism in Section 2.4.1 of [RFC2396]. | the escaping mechanism in Section 2.4.1 of [RFCYYYY]. | |||
| Note: To reduce variability, the hexadecimal notation | Note: To reduce variability, the hexadecimal notation | |||
| SHOULD use upper case letters. | SHOULD use upper case letters. | |||
| 3) Replace the original character by the resulting character | 3) Replace the original character by the resulting character | |||
| sequence (i.e. a sequence of %HH triplets). | sequence (i.e. a sequence of %HH triplets). | |||
| Note that in this process (in step 2.3), characters allowed in URI | Note that the ToASCII operation in Step 2) may fail, but only if the | |||
| references and existing escape sequences are not escaped further. | IRI does not conform to the rules in Section 2.2. | |||
| (This mapping is similar to, but different from, the escaping applied | ||||
| when including arbitrary content into some part of a URI.) For | Note: For backwards compatibility with implementations of previous | |||
| example, an IRI of | drafts of this specification, infrastructure accepting IRIs MAY also | |||
| http://www.example.org/red%09rosé#<red> (in XML notation) is | deal with 'ihostname' parts escaped according to Step 3) rather than | |||
| Step 2). For example, Step 2) converts the IRI | ||||
| http://résumé.example.org to | ||||
| http://xn--rsum-bpad.example.org. For backwards compatibility, | ||||
| http://r%C3%A9sum%C3%A9.example.org would also be converted to | ||||
| http://xn--rsum-bpad.example.org. | ||||
| Note that Internationalized Domain Names may be contained in parts of | ||||
| an IRI other than the 'ihostname' part. | ||||
| Note that in this process (in step 3.3), characters allowed in URI | ||||
| references as well as existing escape sequences are not escaped | ||||
| further. (This mapping is similar to, but different from, the | ||||
| escaping applied when including arbitrary content into some part of a | ||||
| URI.) For example, an IRI of | ||||
| http://www.example.org/red%09rosé#red (in XML notation) is | ||||
| converted to | converted to | |||
| http://www.example.org/red%09ros%C3%A9#%3Cred%3E, not to something | http://www.example.org/red%09ros%C3%A9#red, not to something like | |||
| like | ||||
| http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red. | http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red. | |||
| Note that some older software transcoding to UTF-8 may produce | Note that some older software transcoding to UTF-8 may produce | |||
| illegal output for some input, in particular for characters outside | illegal output for some input, in particular for characters outside | |||
| the BMP (Basic Multilingual Plane). As an example, for the following | the BMP (Basic Multilingual Plane). As an example, for the following | |||
| IRI with non-BMP characters (in XML Notation): | IRI with non-BMP characters (in XML Notation): | |||
| http://example.com/ | http://example.com/𐌀𐌁𐌁 | |||
| (the first three letters of the Old Italic alphabet) the correct | (the first three letters of the Old Italic alphabet) the correct | |||
| conversion to a URI is: | conversion to a URI is: | |||
| http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82 | http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82 | |||
| The above mapping produces a URI fully conforming to [RFC2396] (as | The above mapping produces a URI fully conforming to [RFCYYYY] out of | |||
| amended by [RFC2732] and [IDNURI]) out of each IRI. The mapping is | each IRI. The mapping is also an identity transformation for URIs | |||
| also an identity transformation for URIs and is idempotent -- | and is idempotent -- applying the mapping a second time will not | |||
| applying the mapping a second time will not change anything. Every | change anything. Every URI is therefore by definition an IRI. | |||
| URI is therefore by definition an IRI. | ||||
| Note: For backwards compatibility with infrastructure that does not | Note: Earlier drafts of this specification allowed the space | |||
| implement the updates of [IDNURI], converters MAY also convert the | character and various delimiters in IRIs and IRI references. The | |||
| 'ihostname' part of an IRI using the ToASCII operation specified in | full list of these characters was: "<", ">", '"', Space, "{", "}", | |||
| Section 4.1 of [RFCXXXX] between Step 1 and Step 2. Note that the | "|", "\", "^", and "`", i.e. all printable characters in US-ASCII | |||
| ToASCII operation may fail. Note that Internationalized Domain Names | that are not allowed in URIs. For backwards compatibility, | |||
| may be contained in parts of an IRI other than the 'ihostname' part. | implementations MAY also include these characters in step 3) above. | |||
| If such characters are found but are not converted, then the | ||||
| conversion SHOULD fail. Please note that the number sign ("#"), the | ||||
| percent sign ("%"), and the square bracket characters ("[", "]") are | ||||
| not part of the above list, and MUST not be converted. Protocols and | ||||
| formats that have used earlier definitions of IRIs including these | ||||
| characters MAY require unescaping of these characters as a | ||||
| preprocessing step to extract the actual IRI from a given field. | ||||
| Such preprocessing MAY also be used by applications allowing the user | ||||
| to enter an IRI. | ||||
| 3.2 Converting URIs to IRIs | 3.2 Converting URIs to IRIs | |||
| In some situations, it may be desirable to try to convert a URI into | In some situations, it may be desirable to try to convert a URI into | |||
| an equivalent IRI. This section gives a procedure to do such a | an equivalent IRI. This section gives a procedure to do such a | |||
| conversion. The conversion described in this section will always | conversion. The conversion described in this section will always | |||
| result in an IRI which maps back to the URI that was used as an input | result in an IRI which maps back to the URI that was used as an input | |||
| for the conversion (except for potential case differences in escape | for the conversion (except for potential case differences in escape | |||
| sequences). However, the IRI resulting from this conversion may not | sequences). However, the IRI resulting from this conversion may not | |||
| be exactly the same as the original IRI (if there ever was one). | be exactly the same as the original IRI (if there ever was one). | |||
| URI to IRI conversion removes escape sequences, but not all escaping | URI to IRI conversion removes escape sequences, but not all escaping | |||
| can be eliminated. There are several reasons for this: | can be eliminated. There are several reasons for this: | |||
| a) Some escape sequences are necessary to distinguish escaped and | a) Some escape sequences are necessary to distinguish escaped and | |||
| unescaped uses of reserved characters. | unescaped uses of reserved characters. | |||
| b) Some escape sequences cannot be interpreted as sequences of | b) Some escape sequences cannot be interpreted as sequences of | |||
| UTF-8 octets. | UTF-8 octets. | |||
| (Note: Due to the regularities in the octet patterns of UTF-8, | (Note: The octet patterns of UTF-8 are highly regular. | |||
| there is a very high probability, but no guarantee, that escape | Therefore, there is a very high probability, but no guarantee, | |||
| sequences that can be interpreted as sequences of UTF-8 octets | that escape sequences that can be interpreted as sequences of | |||
| actually originated from UTF-8. For a detailed discussion, see | UTF-8 octets actually originated from UTF-8. For a detailed | |||
| [Duerst97].) | discussion, see [Duerst97].) | |||
| c) The conversion may result in a character that is not | c) The conversion may result in a character that is not | |||
| appropriate in an IRI. See Section 5.1 for further details. | appropriate in an IRI. See Section 6.1 for further details. | |||
| Conversion from a URI to an IRI is done using the following steps (or | Conversion from a URI to an IRI is done using the following steps (or | |||
| any other algorithm that produces the same result): | any other algorithm that produces the same result): | |||
| 1) Represent the URI as a sequence of octets in US-ASCII. | 1) Represent the URI as a sequence of octets in US-ASCII. | |||
| 2) Convert all hexadecimal escapes (% followed by two hexadecimal | 2) Replace any punycode-encoded domainlabel in the URI by the | |||
| digits) except those corresponding to '#' and '%' and | result of the ToUnicode function represented as UTF-8. | |||
| characters in 'reserved', to the corresponding octets. | ||||
| 3) Re-escape any octet produced in step 2) that is not part of a | 3) Convert all hexadecimal escapes (% followed by two hexadecimal | |||
| digits) except those corresponding to '%', characters in | ||||
| 'reserved', and characters in US-ASCII not allowed in URIs, to | ||||
| the corresponding octets. | ||||
| 4) Re-escape any octet produced in step 3) that is not part of a | ||||
| strictly legal UTF-8 octet sequence. | strictly legal UTF-8 octet sequence. | |||
| 4) Re-escape all octets produced in step 2) that in UTF-8 | 5) Re-escape all octets produced in step 3) that in UTF-8 | |||
| represent characters that are not appropriate according to | represent characters that are not appropriate according to | |||
| Section 4.1 and Section 5.1. | Section 4.1 and Section 6.1. | |||
| 5) Interpret the resulting octet sequence as a sequence of | 6) Interpret the resulting octet sequence as a sequence of | |||
| characters encoded in UTF-8. | characters encoded in UTF-8. | |||
| This procedure will convert as many escaped non-ASCII characters as | This procedure will convert as many escaped non-ASCII characters as | |||
| possible to characters in an IRI. Because there are some choices | possible to characters in an IRI. Because there are some choices | |||
| when applying step 4) (see Section 5.1), results may differ. | when applying step 5) (see Section 6.1), results may vary. | |||
| Conversions from URIs to IRIs MUST NOT use any other encoding than | Conversions from URIs to IRIs MUST NOT use any other encoding than | |||
| UTF-8 in steps 3) and 4) above, even if it might be possible from | UTF-8 in steps 2), 4) and 5) above, even if it might be possible from | |||
| context to guess that another encoding than UTF-8 was used in the | context to guess that another encoding than UTF-8 was used in the | |||
| URI. As an example, the URI http://www.example.org/r%E9sum%E9.html, | URI. As an example, the URI http://www.example.org/r%E9sum%E9.html | |||
| which with some guesses might be interpreted to contain two e-acute | might with some guessing be interpreted to contain two e-acute | |||
| characters encoded as iso-8859-1, must not be converted to an IRI | characters encoded as iso-8859-1. It must not be converted to an IRI | |||
| containing these e-acute characters. Otherwise, the IRI will in the | containing these e-acute characters. Otherwise, the IRI will in the | |||
| future be mapped to http://www.example.org/r%C3%A9sum%C3%A9.html, | future be mapped to http://www.example.org/r%C3%A9sum%C3%A9.html, | |||
| which is a different URI from http://www.example.org/r%E9sum%E9.html. | which is a different URI from http://www.example.org/r%E9sum%E9.html. | |||
| 3.2.1 Examples | 3.2.1 Examples | |||
| This section shows various examples of converting URIs to IRIs. The | This section shows various examples of converting URIs to IRIs. The | |||
| notation <hh> is used to denote octets outside those that can be | notation <hh> is used to denote octets outside those that can be | |||
| represented in this document. Each example shows the result after | represented in this document. Each example shows the result after | |||
| applying each of the steps 1) to 5). XML Notation is used for the | applying each of the steps 1) to 6). XML Notation is used for the | |||
| final result. | final result. | |||
| The following example contains the sequence '%C3%BC', which is a | The following example contains the sequence '%C3%BC', which is a | |||
| strictly legal UTF-8 sequence, and which is converted into the actual | strictly legal UTF-8 sequence, and which is converted into the actual | |||
| character U+00FC LATIN SMALL LETTER U WITH DIAERESIS (also known as | character U+00FC LATIN SMALL LETTER U WITH DIAERESIS (also known as | |||
| u-umlaut). | u-umlaut). | |||
| 1) http://www.example.org/D%C3%BCrst | 1) http://www.example.org/D%C3%BCrst | |||
| 2) http://www.example.org/D<c3><bc>rst | 2) http://www.example.org/D%C3%BCrst | |||
| 3) http://www.example.org/D<c3><bc>rst | 3) http://www.example.org/D<c3><bc>rst | |||
| 4) http://www.example.org/D<c3><bc>rst | 4) http://www.example.org/D<c3><bc>rst | |||
| 5) http://www.example.org/Dürst | 5) http://www.example.org/D<c3><bc>rst | |||
| 6) http://www.example.org/Dürst | ||||
| The following example contains the sequence '%FC', which might | The following example contains the sequence '%FC', which might | |||
| represent U+00FC LATIN SMALL LETTER U WITH DIAERESIS in the iso-8859- | represent U+00FC LATIN SMALL LETTER U WITH DIAERESIS in the | |||
| 1 encoding. (It might represent other characters in other encodings. | iso-8859-1 encoding. (It might represent other characters in other | |||
| For example, the octet <FC> in iso-8859-5 represents U+045C CYRILLIC | encodings. For example, the octet <FC> in iso-8859-5 represents | |||
| SMALL LETTER KJE.) Because <FC> is not part of a strictly legal UTF-8 | U+045C CYRILLIC SMALL LETTER KJE.) Because <FC> is not part of a | |||
| sequence, it is re-escaped in step 2). | strictly legal UTF-8 sequence, it is re-escaped in step 2). | |||
| 1) http://www.example.org/D%FCrst | 1) http://www.example.org/D%FCrst | |||
| 2) http://www.example.org/D<FC>rst | 2) http://www.example.org/D%FCrst | |||
| 3) http://www.example.org/D<FC>rst | ||||
| 3) http://www.example.org/D%FCrst | ||||
| 4) http://www.example.org/D%FCrst | 4) http://www.example.org/D%FCrst | |||
| 5) http://www.example.org/D%FCrst | 5) http://www.example.org/D%FCrst | |||
| The following example contains '%e2%80%ae', which is the escaped UTF- | 6) http://www.example.org/D%FCrst | |||
| 8 encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. Section 4.1 forbids | ||||
| the direct use of this character in an IRI. Therefore, the | The following example contains '%e2%80%ae', which is the escaped | |||
| corresponding octets are re-escaped in step 3). This example shows | UTF-8 encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. Section 4.1 | |||
| forbids the direct use of this character in an IRI. Therefore, the | ||||
| corresponding octets are re-escaped in step 5). This example shows | ||||
| that the case (upper or lower) of letters used in escapes may not be | that the case (upper or lower) of letters used in escapes may not be | |||
| preserved. | preserved. The example also contains a punycode-encoded domain name | |||
| label (xn--99zt52a), which is converted to the corresponding | ||||
| characters U+7D0D U+8C46 (Japanese Natto). | ||||
| 1) http://www.example.org/%e2%80%ae | 1) http://xn--99zt52a.example.org/%e2%80%ae | |||
| 2) http://www.example.org/<E2><80><AE> | 2) http://<E7><B4><8D><E8><B1><86>.example.org/%e2%80%ae | |||
| 3) http://www.example.org/<E2><80><AE> | 3) http://<E7><B4><8D><E8><B1><86>.example.org/<E2><80><AE> | |||
| 4) http://www.example.org/%E2%80%AE | 4) http://<E7><B4><8D><E8><B1><86>.example.org/<E2><80><AE> | |||
| 5) http://www.example.org/%E2%80%AE | 5) http://<E7><B4><8D><E8><B1><86>.example.org/%E2%80%AE | |||
| 6) http://納豆.example.org/%E2%80%AE | ||||
| 4. Bidirectional IRIs for Right-to-left Languages | 4. Bidirectional IRIs for Right-to-left Languages | |||
| Some UCS characters, such as those used in the Arabic and Hebrew | Some UCS characters, such as those used in the Arabic and Hebrew | |||
| script, have an inherent right-to-left writing direction. IRIs | script, have an inherent right-to-left (rtl) writing direction. IRIs | |||
| containing such characters (called bidirectional IRIs or Bidi IRIs) | containing such characters (called bidirectional IRIs or Bidi IRIs) | |||
| require additional attention because of the non-trivial relation | require additional attention because of the non-trivial relation | |||
| between logical representation (used for digital representation as | between logical representation (used for digital representation as | |||
| well as when reading/spelling) and visual representation (used for | well as when reading/spelling) and visual representation (used for | |||
| display/printing). | display/printing). | |||
| Because of the complex interaction between the logical | Because of the complex interaction between the logical | |||
| representation, the visual representation, and the syntax of a Bidi | representation, the visual representation, and the syntax of a Bidi | |||
| IRI, a balance is needed between various requirements. The main | IRI, a balance is needed between various requirements. The main | |||
| requirements are (1) user-predictable conversion between visual and | requirements are: | |||
| logical representation; (2) the ability to include a wide range of | ||||
| characters in various parts of the IRI; (3) no or not too big changes | 1) user-predictable conversion between visual and logical | |||
| or restrictions for implementations. | representation; | |||
| 2) the ability to include a wide range of characters in various | ||||
| parts of the IRI; | ||||
| 3) no or not too big changes or restrictions for implementations. | ||||
| 4.1 Logical Storage and Visual Presentation | 4.1 Logical Storage and Visual Presentation | |||
| In their internal digital representation, i.e. stored or transmitted | When stored or transmitted in digital representation, bidirectional | |||
| for resolution, bidirectional IRIs MUST be in full logical order, and | IRIs MUST be in full logical order, and MUST conform to the IRI | |||
| MUST conform directly to the IRI syntax rules (which includes the | syntax rules (which includes the rules relevant to their scheme). | |||
| rules relevant to their scheme). This assures that bidirectional | This assures that bidirectional IRIs can be processed in the same way | |||
| IRIs can be processed in the same way as other IRIs. | as other IRIs. | |||
| When rendered, bidirectional IRIs MUST be rendered using the Unicode | When rendered, bidirectional IRIs MUST be rendered using the Unicode | |||
| Bidirectional Algorithm [UNIV3], [UNI9]. Bidirectional IRIs MUST be | Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be | |||
| rendered with an overall left-to-right direction. | rendered with an overall left-to-right (ltr) direction. | |||
| In text with a left-to-right base directionality or embedding (e.g | In text with a left-to-right base directionality or embedding (as | |||
| English, Cyrillic), the Unicode Bidirectional Algorithm will | used for e.g. English or Cyrillic), the Unicode Bidirectional | |||
| automatically use an overall left-to-right direction for the IRI. In | Algorithm will automatically use an overall ltr direction for the | |||
| text with a right-to-left base directionality or embedding (e.g. | IRI. In text with a rtl base directionality or embedding (as used | |||
| Arabic or Hebrew), some kind of embedding is needed. This may be | e.g. for Arabic or Hebrew), setting a different embedding direction | |||
| Unicode bidi formatting codes (LRE before the IRI, and PDF after the | for the IRI is needed. Setting the embedding direction can be done | |||
| IRI, both not part of the IRI itself) or equivalent features of a | in a higher-order protocol (e.g. the dir='ltr' attribute in HTML). | |||
| higher-order protocol (e.g. the dir='ltr' attribute in HTML). | If this is not available (e.g. in plain text), setting the embedding | |||
| is done with Unicode bidi formatting codes, i.e. U+202A, LEFT-TO- | ||||
| RIGHT EMBEDDING (LRE) before the IRI, and U+202C, POP DIRECTIONAL | ||||
| FORMATTING (PDF) after the IRI, both not being part of the IRI | ||||
| itself. | ||||
| IRIs MUST NOT contain bidirectional formatting characters (LRM, RLM, | IRIs MUST NOT contain bidirectional formatting characters (LRM, RLM, | |||
| LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of | LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of | |||
| the IRI, but do not itself appear visually. It would therefore not | the IRI, but do not themselves appear visually. It would therefore | |||
| be possible to again correctly input an IRI with such characters. | not be possible to correctly input an IRI with such characters. | |||
| 4.2 Bidi IRI Structure | 4.2 Bidi IRI Structure | |||
| The Unicode Bidirectional Algorithm is designed mainly for running | The Unicode Bidirectional Algorithm is designed mainly for running | |||
| text. To make sure that it does not affect the rendering of | text. To make sure that it does not affect the rendering of | |||
| bidirectional IRIs too much, some restrictions on bidirectional IRIs | bidirectional IRIs too much, some restrictions on bidirectional IRIs | |||
| are necessary. These restrictions are given in terms of delimiters | are necessary. These restrictions are given in terms of delimiters | |||
| (structural characters, mostly punctuation such as | (structural characters, mostly punctuation such as '@', '.', ':', | |||
| '@', '.', ':', '/') and components (usually consisting mostly of | '/') and components (usually consisting mostly of letters and | |||
| letters and digits). | digits). | |||
| The following syntax rules from Section 2.2 correspond to components | The following syntax rules from Section 2.2 correspond to components | |||
| for the purpose of Bidi behavior: iopaquepart, irelsegment, iregname, | for the purpose of Bidi behavior: iuserinfo, isegment, ihostname, | |||
| iuserinfo, isegment, iparam, ihostname, iquery, and ifragment. | iquery, and ifragment. | |||
| Specifications that define the syntax of any of the above components | Specifications that define the syntax of any of the above components | |||
| MAY divide them further and define smaller parts to be components | MAY divide them further and define smaller parts to be components | |||
| according to this document. As an example, the restrictions of | according to this document. As an example, the restrictions of | |||
| [RFCXXXX] on bidirectional domain names correspond to treating each | [RFC3490] on bidirectional domain names correspond to treating each | |||
| label of the domain name as a component. Even where the components | label of the domain name as a component. Even where the components | |||
| are not defined formally, it may be helpful to think about some | are not defined formally, it may be helpful to think about some | |||
| syntax in terms of components and to apply the relevant restrictions. | syntax in terms of components and to apply the relevant restrictions. | |||
| For example, for the usual name/value syntax in query parts, it is | For example, for the usual name/value syntax in query parts, it is | |||
| convenient to treat each name and each value as a component. As | convenient to treat each name and each value as a component. As | |||
| another example, the extensions in a resource name can be treated as | another example, the extensions in a resource name can be treated as | |||
| separate components. | separate components. | |||
| For each component, the following restrictions apply: | For each component, the following restrictions apply: | |||
| skipping to change at page 18, line 26 | skipping to change at page 17, line 33 | |||
| 2) A component using right-to-left characters SHOULD start and end | 2) A component using right-to-left characters SHOULD start and end | |||
| with right-to-left characters. | with right-to-left characters. | |||
| The above restrictions are given as shoulds, rather than as musts. | The above restrictions are given as shoulds, rather than as musts. | |||
| For IRIs that are never presented visually, they are not relevant. | For IRIs that are never presented visually, they are not relevant. | |||
| However, for IRIs in general, they are very important to insure | However, for IRIs in general, they are very important to insure | |||
| consistent conversion between visual presentation and logical | consistent conversion between visual presentation and logical | |||
| representation, in both directions. | representation, in both directions. | |||
| In some components, the above restrictions may actually be strictly | In some components, the above restrictions may actually be strictly | |||
| enforced. For example, [RFCXXXX] requires that these restrictions | enforced. For example, [RFC3490] requires that these restrictions | |||
| apply to the labels of the host name part of an IRI. In some other | apply to the labels of the host name part of an IRI. In some other | |||
| components, for example path components, following these restrictions | components, for example path components, following these restrictions | |||
| may not be too difficult. For other components, such as parts of the | may not be too difficult. For other components, such as parts of the | |||
| query part, it may be very difficult to enforce the restrictions, | query part, it may be very difficult to enforce the restrictions, | |||
| because the values of query parameters may be arbitrary character | because the values of query parameters may be arbitrary character | |||
| sequences. | sequences. | |||
| In order to satisfy the above restrictions, the affected component | If the above restrictions cannot be satisfied otherwise, the affected | |||
| can be mapped to URI notation as described in Section 3.1. Please | component can always be mapped to URI notation as described in | |||
| note that the whole component needs to be mapped (see also Example 9 | Section 3.1. Please note that the whole component needs to be mapped | |||
| below). | (see also Example 9 below). | |||
| 4.3 Input of Bidi IRIs | 4.3 Input of Bidi IRIs | |||
| Bidi input methods MUST generate Bidi IRIs in logical order while | Bidi input methods MUST generate Bidi IRIs in logical order while | |||
| rendering them according to Section 4.1. During input, rendering | rendering them according to Section 4.1. During input, rendering | |||
| should be updated after every new character that is input to avoid | SHOULD be updated after every new character that is input to avoid | |||
| end user confusion. | end user confusion. | |||
| 4.4 Examples | 4.4 Examples | |||
| This section gives examples of bidirectional IRIs, in Bidi Notation. | This section gives examples of bidirectional IRIs, in Bidi Notation. | |||
| It shows legal IRIs with the relationship between logical and visual | It shows legal IRIs with the relationship between logical and visual | |||
| representation, and explains how certain phenomena in this | representation, and explains how certain phenomena in this | |||
| relationship may look strange to somebody not familiar with | relationship may look strange to somebody not familiar with | |||
| bidirectional behavior, but familiar to users of Arabic and Hebrew. | bidirectional behavior, but familiar to users of Arabic and Hebrew. | |||
| It also shows what happens if the restrictions given in Section 4.2 | It also shows what happens if the restrictions given in Section 4.2 | |||
| are not followed. The examples below can be seen at [BidiEx], in | are not followed. The examples below can be seen at [BidiEx], in | |||
| Arabic, Hebrew, and Bidi Notation variants. | Arabic, Hebrew, and Bidi Notation variants. | |||
| Example 1: A single component with right-to-left (rtl) characters is | To read the bidi text in the examples, read the visual representation | |||
| inverted: | from left to right until you encounter a block of rtl text. Read the | |||
| logical representation: http://ab.CDEFGH.ij/kl/mn/op.html, | rtl block (including slashes and other special characters) from right | |||
| visual representation: http://ab.HGFEDC.ij/kl/mn/op.html. | to left, then continue at the next unread ltr character. | |||
| Example 1: A single component with rtl characters is inverted: | ||||
| logical representation: http://ab.CDEFGH.ij/kl/mn/op.html | ||||
| visual representation: http://ab.HGFEDC.ij/kl/mn/op.html | ||||
| Components can be read one-by-one, and each component can be read in | Components can be read one-by-one, and each component can be read in | |||
| its natural direction. | its natural direction. | |||
| Example 2: More than one consecutive component with rtl characters is | Example 2: More than one consecutive component with rtl characters is | |||
| inverted as a whole: | inverted as a whole: | |||
| logical representation: http://ab.CDE.FGH/ij/kl/mn/op.html, | logical representation: http://ab.CDE.FGH/ij/kl/mn/op.html | |||
| visual representation: http://ab.HGF.EDC/ij/kl/mn/op.html. | visual representation: http://ab.HGF.EDC/ij/kl/mn/op.html | |||
| A sequence of rtl components is read rtl, in the same way as a | A sequence of rtl components is read rtl, in the same way as a | |||
| sequence of rtl words is read rtl in a bidi text. | sequence of rtl words is read rtl in a bidi text. | |||
| Example 3: All components of an IRI (except for the scheme) are rtl. | Example 3: All components of an IRI (except for the scheme) are rtl. | |||
| All rtl components are inverted overall: | All rtl components are inverted overall: | |||
| logical representation: http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV, | logical representation: http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV | |||
| visual representation: http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA. | visual representation: http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA | |||
| The whole IRI (except the scheme) is read rtl. Delimiters between | The whole IRI (except the scheme) is read rtl. Delimiters between | |||
| rtl components stay between the respective components; delimiters | rtl components stay between the respective components; delimiters | |||
| between ltr and rtl components don't move. | between ltr and rtl components don't move. | |||
| Example 4: Several sequences of rtl components are each inverted on | Example 4: Several sequences of rtl components are each inverted on | |||
| their own: | their own: | |||
| logical representation: http://AB.CD.ef/gh/IJ/KL.html, | logical representation: http://AB.CD.ef/gh/IJ/KL.html | |||
| visual representation: http://DC.BA.ef/gh/LK/JI.html. | visual representation: http://DC.BA.ef/gh/LK/JI.html | |||
| Each sequence of rtl components is read rtl, in the same way as each | Each sequence of rtl components is read rtl, in the same way as each | |||
| sequence of rtl words in an ltr text is read rtl. | sequence of rtl words in an ltr text is read rtl. | |||
| Example 5: Example 2, applied to components of different kinds: | Example 5: Example 2, applied to components of different kinds: | |||
| logical representation: http://ab.cd.EF/GH/ij/kl.html, | logical representation: http://ab.cd.EF/GH/ij/kl.html | |||
| visual representation: http://ab.cd.HG/FE/ij/kl.html. | visual representation: http://ab.cd.HG/FE/ij/kl.html | |||
| The inversion of the domain name label and the path component may be | The inversion of the domain name label and the path component may be | |||
| unexpected, but is consistent with other bidi behavior. | unexpected, but is consistent with other bidi behavior. For | |||
| reassurance that the domain component really is "ab.cd.EF", it may be | ||||
| helpful to read aloud the visual representation following the bidi | ||||
| algorithm. After "http://ab.cd." one reads the RTL block "E-F-slash- | ||||
| G-H", which corresponds to the logical representation. | ||||
| Example 6: Same as example 5, with more rtl components: | Example 6: Same as example 5, with more rtl components: | |||
| logical representation: http://ab.CD.EF/GH/IJ/kl.html, | logical representation: http://ab.CD.EF/GH/IJ/kl.html | |||
| visual representation: http://ab.JI/HG/FE.DC/kl.html. | visual representation: http://ab.JI/HG/FE.DC/kl.html | |||
| The inversion of the domain name labels and the path components may | The inversion of the domain name labels and the path components may | |||
| be easier to identify because the delimiters also move. | be easier to identify because the delimiters also move. | |||
| Example 7: A single rtl component with included digits: | Example 7: A single rtl component with included digits: | |||
| logical representation: http://ab.CDE123FGH.ij/kl/mn/op.html, | logical representation: http://ab.CDE123FGH.ij/kl/mn/op.html | |||
| visual representation: http://ab.HGF123EDC.ij/kl/mn/op.html. | visual representation: http://ab.HGF123EDC.ij/kl/mn/op.html | |||
| Numbers are written ltr in all cases, but are treated as an | Numbers are written ltr in all cases, but are treated as an | |||
| additional embedding inside a run of rtl characters. This is | additional embedding inside a run of rtl characters. This is | |||
| completely consistent with usual bidirectional text. | completely consistent with usual bidirectional text. | |||
| Example 8 (not allowed): Numbers at the start or end of a rtl | Example 8 (not allowed): Numbers at the start or end of a rtl | |||
| component: | component: | |||
| logical representation: http://ab.cd.ef/GH1/2IJ/KL.html, | logical representation: http://ab.cd.ef/GH1/2IJ/KL.html | |||
| visual representation: http://ab.cd.ef/LK/JI1/2HG.html. | visual representation: http://ab.cd.ef/LK/JI1/2HG.html | |||
| The sequence '1/2' is interpreted by the bidi algorithm as a | The sequence '1/2' is interpreted by the bidi algorithm as a | |||
| fraction, fragmenting the components and leading to confusion. There | fraction, fragmenting the components and leading to confusion. There | |||
| are other characters that are interpreted in a special way close to | are other characters that are interpreted in a special way close to | |||
| numbers, in particular '+', '-', '#', '$', '%', ',', '.', and ':'. | numbers, in particular '+', '-', '#', '$', '%', ',', '.', and ':'. | |||
| Example 9 (not allowed): The numbers in the previous example are | Example 9 (not allowed): The numbers in the previous example are | |||
| escaped: | escaped: | |||
| logical representation: http://ab.cd.ef/GH%31/%32IJ/KL.html, | logical representation: http://ab.cd.ef/GH%31/%32IJ/KL.html, | |||
| visual representation (Hebrew): http://ab.cd.ef/LK/JI%32/%31HG.html, | visual representation (Hebrew): http://ab.cd.ef/LK/JI%32/%31HG.html | |||
| visual representation (Arabic): http://ab.cd.ef/LK/JI32%/31%HG.html. | visual representation (Arabic): http://ab.cd.ef/LK/JI32%/31%HG.html | |||
| Depending on whether the upper-case letters represent Arabic or | Depending on whether the upper-case letters represent Arabic or | |||
| Hebrew, the visual representation is different. | Hebrew, the visual representation is different. | |||
| 5. Use of IRIs | 5. IRI Equivalence and Comparison | |||
| 5.1 Limitations on UCS Characters Allowed in IRIs | This section discusses IRI Equivalence and Comparison similar to | |||
| Section 6, "Normalization and Comparison", in [RFCYYYY]. This | ||||
| section focusses on the main issues and on aspects that are different | ||||
| from [RFCYYYY]; Section 6 of [RFCYYYY] is recommended background | ||||
| reading. | ||||
| This section discusses the limitations on characters and character | There is no general rule or procedure to decide whether two arbitrary | |||
| IRIs are equivalent or not (i.e. whether they refer to the same | ||||
| resource or not). Two IRIs that look almost the same may refer to | ||||
| different resources. Two IRIs that look completely different may | ||||
| refer to the same resource. Each specification or application that | ||||
| uses IRIs has to decide on the appropriate criterion for IRI | ||||
| equivalence. | ||||
| 5.1 Simple String Comparison | ||||
| In some scenarios a definite answer to the question of IRI | ||||
| equivalence is needed that is independent of the scheme used and | ||||
| always can be calculated quickly and without accessing a network. An | ||||
| example of such a case is XML Namespaces ([XMLNamespace]). In such | ||||
| cases, two IRIs SHOULD be defined as equivalent if and only if they | ||||
| are character-by-character equivalent. This is the same as being | ||||
| byte-by-byte equivalent if the character encoding for both IRIs is | ||||
| the same. As an example, | ||||
| http://example.org/~user, http://example.org/%7euser, and | ||||
| http://example.org/%7Euser are not equivalent under this definition. | ||||
| In such a case, the comparison function MUST NOT map IRIs to URIs, | ||||
| because such a mapping would create additional spurious equivalences. | ||||
| It follows that IRIs SHOULD NOT be modified when being transported if | ||||
| there is any chance that this IRI might be used as an identifier in | ||||
| the way explained above. | ||||
| 5.2 Conversion to URIs | ||||
| For actual resolution, differences in escaping (except for the | ||||
| escaping of reserved characters) MUST always result in the same | ||||
| resource. For example, http://example.org/~user, | ||||
| http://example.org/%7euser and http://example.org/%7Euser must | ||||
| resolve to the same resource. | ||||
| If this kind of equivalence is to be tested, the escaping of both | ||||
| IRIs to be compared has to be aligned, for example by converting both | ||||
| IRIs to URIs (see Section 3.1) and making sure that the case of the | ||||
| hexadecimal characters in the %-escape is always the same (preferably | ||||
| upper case). For comparison, such conversions MUST only be done on | ||||
| the fly, while retaining the original IRI. | ||||
| Additional, similar equivalences are possible based on knowledge | ||||
| about the generic URI/IRI syntax, such as the fact that the scheme | ||||
| part is case-insensitive. | ||||
| 5.3 Normalization | ||||
| The Unicode Standard [UNIV4] defines various equivalences between | ||||
| sequences of characters for various purposes. Unicode Standard Annex | ||||
| #15 [UTR15] defines various Normalization Forms for these | ||||
| equivalences, in particular Normalization Form C (NFC, Canonical | ||||
| Decomposition, followed by Canonical Composition) and Normalization | ||||
| Form KC (NFKC, Compatibility Decomposition, followed by Canonical | ||||
| Composition). | ||||
| Equivalence of IRIs MUST rely on the assumption that IRIs are | ||||
| appropriately pre-normalized, rather than applying normalization when | ||||
| comparing two IRIs. The exceptions are convertsion from a non- | ||||
| digital form, and conversion from a non-UCS-based encoding to an UCS- | ||||
| based encoding. In these cases, NFC or a normalizing transcoder | ||||
| using NFC MUST be used for interoperability. To avoid false | ||||
| negatives and problems with transcoding, IRIs SHOULD be created using | ||||
| NFC. Using NFKC will avoid even more problems. | ||||
| As an example, http://www.example.org/résumé.html (in XML | ||||
| Notation) is in NFC. On the other hand, http://www.example.org/ | ||||
| résumé.html is not in NFC. The former uses precombined | ||||
| e-acute characters, the later uses 'e' characters followed by | ||||
| combining acute accents. Both usages are defined to be canonically | ||||
| equivalent in [UNIV4]. | ||||
| Because we do not know how a particular field is treated with respect | ||||
| to text normalization, it would be inappropriate to allow third | ||||
| parties to normalize an IRI arbitrarily. This does not contradict | ||||
| the recommendation that if you create a resource, and an IRI for that | ||||
| resource, you try to be as normalized as possible (i.e. NFKC if | ||||
| possible). This is similar to the upper-case/lower-case problems in | ||||
| URIs. Some parts of an URI are case-insensitive (domain name). For | ||||
| others, it is unclear whether they are case-sensitive or case- | ||||
| insensitive, or something in between (e.g. case-sensitive, but if | ||||
| you use the wrong case, may not directly get a result, but rather a | ||||
| 'Multiple choices'). The best recipe we have there is that the | ||||
| generator uses a reasonable capitalization, and when transfering the | ||||
| URI, you do not change capitalization. | ||||
| Various IRI schemes may allow the usage of International Domain Names | ||||
| (IDN) [RFC3490]. When in use in IRIs, those names SHOULD be | ||||
| validated using the ToASCII operation defined in [RFC3490], with the | ||||
| flags "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing | ||||
| an invalid IDN cannot successfully be resolved. For legibility | ||||
| purposes, IDN components of IRIs SHOULD not be converted into ASCII | ||||
| Compatible Encoding (ACE). However, this conversion is applied when | ||||
| mapping an IRI into an URI, see Section 3.1. | ||||
| 5.4 Preferred Forms | ||||
| The following are the preferred forms for IRIs when generated: | ||||
| - Always provide the URI scheme in lowercase characters. | ||||
| - Only perform percent-escaping where it is essential. | ||||
| - Always use uppercase A-through-F characters when percent- | ||||
| escaping. | ||||
| - Always provide the hostname, if any, in the form produced when | ||||
| applying [RFC3491]. This in particular includes using | ||||
| lowercase characters rather than uppercase characters where | ||||
| applicable. | ||||
| - Where possible, provide IRI components in NFKC or NFC. | ||||
| - Prevent /./ and /../ from appearing in non-relative URI paths. | ||||
| 6. Use of IRIs | ||||
| 6.1 Limitations on UCS Characters Allowed in IRIs | ||||
| This section discusses limitations on characters and character | ||||
| sequences usable for IRIs. The considerations in this section are | sequences usable for IRIs. The considerations in this section are | |||
| relevant when creating IRIs and when converting from URIs to IRIs. | relevant when creating IRIs and when converting from URIs to IRIs. | |||
| a) The repertoire of characters allowed in each IRI component is | a) The repertoire of characters allowed in each IRI component is | |||
| limited by the definition of that component. For example, the | limited by the definition of that component. For example, the | |||
| definition of the scheme component does not allow characters | definition of the scheme component does not allow characters | |||
| beyond US-ASCII. | beyond US-ASCII. | |||
| (Note: In accordance with URI practice, generic IRI software | (Note: In accordance with URI practice, generic IRI software | |||
| cannot and should not check for such limitations.) | cannot and should not check for such limitations.) | |||
| b) In the URI syntax, characters that are likely to be used to | b) The UCS contains many areas of characters for which there are | |||
| delimit URIs in text and print ("space", "delims", and | ||||
| "unwise") were excluded. They are included in the IRI syntax | ||||
| (with the exception of '%', which cannot be used directly, and | ||||
| '#', which is used in IRI references), for the following | ||||
| reasons: | ||||
| 1) The syntax includes many other characters that are not | ||||
| appropriate in many cases. | ||||
| 2) Some implementation practice already allows them in URI | ||||
| references (for example spaces in fragment identifiers). | ||||
| 3) It is very convenient in some cases, for example for | ||||
| XPointers in XML attributes. | ||||
| 4) Considering context is already necessary in the case of | ||||
| URIs, for example for "&" in XML. | ||||
| However, these characters should be avoided where possible. | ||||
| Whenever there is a chance that an IRI will be used in a | ||||
| component where these characters can be harmful, they should be | ||||
| escaped from the start. | ||||
| c) The UCS contains many areas of characters for which there are | ||||
| strong visual look-alikes. Because of the likelihood of | strong visual look-alikes. Because of the likelihood of | |||
| transcription errors, these also should be avoided. This | transcription errors, these also should be avoided. This | |||
| includes the full-width equivalents of ASCII characters, half- | includes the full-width equivalents of ASCII characters, half- | |||
| width Katakana characters for Japanese, and many others. This | width Katakana characters for Japanese, and many others. This | |||
| also includes many look-alikes of "space", "delims", and | also includes many look-alikes of "space", "delims", and | |||
| "unwise", characters excluded in [RFC2396]. | "unwise", characters excluded in [RFC3491]. | |||
| Additional information is available from [UNIXML]. [UNIXML] is | Additional information is available from [UNIXML]. [UNIXML] is | |||
| written in the context of running text rather than in the context of | written in the context of running text rather than in the context of | |||
| identifiers. Nevertheless, it discusses many of the categories of | identifiers. Nevertheless, it discusses many of the categories of | |||
| characters and code points not appropriate for IRIs. | characters and code points not appropriate for IRIs. | |||
| 5.2 Software Interfaces and Protocols | 6.2 Software Interfaces and Protocols | |||
| Although an IRI is defined as a sequence of characters, software | Although an IRI is defined as a sequence of characters, software | |||
| interfaces for URIs typically function on sequences of octets or | interfaces for URIs typically function on sequences of octets or | |||
| other kinds of code units. Thus, software interfaces and protocols | other kinds of code units. Thus, software interfaces and protocols | |||
| MUST define which character encoding is used. | MUST define which character encoding is used. | |||
| Intermediate software interfaces between IRI-capable components and | Intermediate software interfaces between IRI-capable components and | |||
| URI-only components MUST map the IRIs as per Section 3.1, when | URI-only components MUST map the IRIs per Section 3.1, when | |||
| transferring from IRI-capable to URI-only components. Such a mapping | transferring from IRI-capable to URI-only components. Such a mapping | |||
| SHOULD be applied as late as possible. It should not be applied | SHOULD be applied as late as possible. It should not be applied | |||
| between components that are known to be able to handle IRIs. | between components that are known to be able to handle IRIs. | |||
| 5.3 Format of URIs and IRIs in Documents and Protocols | 6.3 Format of URIs and IRIs in Documents and Protocols | |||
| Document formats that transport URIs may need to be upgraded to allow | Document formats that transport URIs may need to be upgraded to allow | |||
| the transport of IRIs. In those cases where the document as a whole | the transport of IRIs. In those cases where the document as a whole | |||
| has a native character encoding, IRIs MUST also be encoded in this | has a native character encoding, IRIs MUST also be encoded in this | |||
| encoding, and converted accordingly by a parser or interpreter. IRI | encoding, and converted accordingly by a parser or interpreter. IRI | |||
| characters that are not expressible in the native encoding SHOULD be | characters that are not expressible in the native encoding SHOULD be | |||
| escaped using the escaping conventions of the document format if such | escaped using the escaping conventions of the document format if such | |||
| conventions are available. Alternatively, they MAY be escaped | conventions are available. Alternatively, they MAY be escaped | |||
| according to Section 3.1. For example, in HTML, XML, or SGML, | according to Section 3.1. For example, in HTML or XML, numeric | |||
| numeric character references should be used. If a document as a | character references SHOULD be used. If a document as a whole has a | |||
| whole has a native character encoding, and that character encoding is | native character encoding, and that character encoding is not UTF-8, | |||
| not UTF-8, then IRIs MUST NOT be placed into the document in the UTF- | then IRIs MUST NOT be placed into the document in the UTF-8 character | |||
| 8 character encoding. | encoding. | |||
| Note: Some formats already accommodate IRIs, although they use | Note: Some formats already accommodate IRIs, although they use | |||
| different terminology. HTML 4.0 [HTML4] defines the conversion from | different terminology. HTML 4.0 [HTML4] defines the conversion from | |||
| IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink | IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink | |||
| [XLink], and XML Schema [XMLSchema] and specifications based upon | [XLink], and XML Schema [XMLSchema] and specifications based upon | |||
| them allow IRIs. Also, it is expected that all relevant new W3C | them allow IRIs. Also, it is expected that all relevant new W3C | |||
| formats and protocols will be required to handle IRIs [CharMod]. | formats and protocols will be required to handle IRIs [CharMod]. | |||
| 5.4 Relative IRI References | 6.4 Use of UTF-8 for Encoding Original Characters | |||
| This section discusses details and gives examples for point c) in | ||||
| Section 1.2. In order to be able to use IRIs, the URI corresponding | ||||
| to the IRI in question has to encode original characters into octets | ||||
| using UTF-8. This can be specified for all URIs of an URI scheme, or | ||||
| can apply to individual URIs for schemes that do not specify how to | ||||
| encode original characters. It can apply to the whole URI, or only | ||||
| some part. | ||||
| For new URI schemes, using UTF-8 is recommended in [RFC2718]. | ||||
| Examples where this is already used are the URN syntax [RFC2141], | ||||
| IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, the | ||||
| HTTP URL scheme does not specify how to encode original characters, | ||||
| and therefore IRIs only can be used for some HTTP URLs. | ||||
| For example, for a document with a URI of | ||||
| http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to | ||||
| construct a corresponding IRI (in XML notation, see Section 1.4): | ||||
| http://www.example.org/résumé.html (é stands for the | ||||
| e-acute character, and %C3%A9 is the UTF-8 encoded and escaped | ||||
| representation of that character). On the other hand, for a document | ||||
| with an URI of http://www.example.org/r%E9sum%E9.html, the escaped | ||||
| octets cannot be converted to actual characters in an IRI, because | ||||
| the escaping is not based on UTF-8. | ||||
| The requirement for the use of UTF-8 applies to all parts of an URI, | ||||
| with the exception of the ihostname part. However, it is possible | ||||
| that the capability of IRIs to represent a wide range of characters | ||||
| directly is used just in some parts of the IRI (or IRI reference). | ||||
| The other parts of the IRI may only contain ASCII characters, or they | ||||
| may not be based on UTF-8. They may be based on another encoding, or | ||||
| they may directly encode raw binary data (see also [RFC2397]). | ||||
| For example, it is possible to have an URI reference of | ||||
| http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9, where the | ||||
| document name is encoded in iso-8859-1 based on server settings, but | ||||
| the fragment identifier is encoded in UTF-8 according to [XPointer]. | ||||
| The IRI corresponding to the above URI would be (in XML notation) | ||||
| http://www.example.org/r%E9sum%E9.xml#résumé. | ||||
| @@@@ add something about query parts | ||||
| 6.5 Relative IRI References | ||||
| Processing of relative forms of IRIs against a base is handled | Processing of relative forms of IRIs against a base is handled | |||
| straightforwardly; the algorithms of RFC 2396 may be applied | straightforwardly; the algorithms of [RFCYYYY] can be applied | |||
| directly, treating the characters additionally allowed in IRIs in the | directly, treating the characters additionally allowed in IRIs in the | |||
| same way as unreserved characters in URIs. | same way as unreserved characters in URIs. | |||
| 6. URI/IRI Processing Guidelines (informative) | 7. URI/IRI Processing Guidelines (informative) | |||
| This informative section provides guidelines for supporting IRIs in | This informative section provides guidelines for supporting IRIs in | |||
| the same software components and operations that currently process | the same software components and operations that currently process | |||
| URIs: software interfaces that handle URIs, software that allows | URIs: software interfaces that handle URIs, software that allows | |||
| users to enter URIs, software that generates URIs, software that | users to enter URIs, software that generates URIs, software that | |||
| displays URIs, formats and protocols that transport URIs, and | displays URIs, formats and protocols that transport URIs, and | |||
| software that interprets URIs. These may all require more or less | software that interprets URIs. These may all require more or less | |||
| modification before functioning properly with IRIs. The | modification before functioning properly with IRIs. The | |||
| considerations in this section also apply to URI references and IRI | considerations in this section also apply to URI references and IRI | |||
| references. | references. | |||
| 6.1 URI/IRI Software Interfaces | 7.1 URI/IRI Software Interfaces | |||
| Software interfaces that handle URIs, such as URI-handling APIs and | Software interfaces that handle URIs, such as URI-handling APIs and | |||
| protocols transferring URIs, need interfaces and protocol elements | protocols transferring URIs, need interfaces and protocol elements | |||
| that are designed to carry IRIs. | that are designed to carry IRIs. | |||
| In case the current handling in an API or protocol is based on US- | In case the current handling in an API or protocol is based on US- | |||
| ASCII, UTF-8 is recommended as the encoding for IRIs, because this is | ASCII, UTF-8 is recommended as the encoding for IRIs, because this is | |||
| compatible with US-ASCII, is in accordance with the recommendations | compatible with US-ASCII, is in accordance with the recommendations | |||
| of [RFC2277], and makes it easy to convert to URIs where necessary. | of [RFC2277], and makes it easy to convert to URIs where necessary. | |||
| In any case, the encoding used must not be left undefined. | In any case, the API or protocol definition must clearly define the | |||
| encoding to be used. | ||||
| The transfer from URI-only to IRI-capable components requires no | The transfer from URI-only to IRI-capable components requires no | |||
| mapping, although the conversion described in Section 3.2 above may | mapping, although the conversion described in Section 3.2 above may | |||
| be performed. It is preferable not to perform this inverse | be performed. It is preferable not to perform this inverse | |||
| conversion when there is a chance that this cannot be done correctly. | conversion when there is a chance that this cannot be done correctly. | |||
| 6.2 URI/IRI Entry | 7.2 URI/IRI Entry | |||
| There are components that allow users to enter URIs into the system, | There are components that allow users to enter URIs into the system, | |||
| for example, by typing or dictation. This software must be updated | for example by typing or dictation. This software must be updated to | |||
| to allow for IRI entry. | allow for IRI entry. | |||
| A person viewing a visual representation of an IRI (as a sequence of | A person viewing a visual representation of an IRI (as a sequence of | |||
| glyphs, in some order, in some visual display) or hearing an IRI, | glyphs, in some order, in some visual display) or hearing an IRI, | |||
| will use a entry method for characters in the user's language to | will use a entry method for characters in the user's language to | |||
| input the IRI. Depending on the script and the input method used, | input the IRI. Depending on the script and the input method used, | |||
| this may be a more or less complicated process. | this may be a more or less complicated process. | |||
| The process of IRI entry must assure, as far as possible, that the | The process of IRI entry must assure, as far as possible, that the | |||
| restrictions defined in Section 2.2 are met. This may be done by | restrictions defined in Section 2.2 are met. This may be done by | |||
| choosing appropriate input methods or variants/settings thereof, by | choosing appropriate input methods or variants/settings thereof, by | |||
| appropriately converting the characters being input, by eliminating | appropriately converting the characters being input, by eliminating | |||
| characters that cannot be converted, and/or by issuing a warning or | characters that cannot be converted, and/or by issuing a warning or | |||
| error message to the user. | error message to the user. | |||
| As an example of variant settings, input method editors for East | As an example of variant settings, input method editors for East | |||
| Asian Languages usually allow to input Latin letters and related | Asian Languages usually allow the input of Latin letters and related | |||
| characters in full-width or half-width versions. For IRI input, the | characters in full-width or half-width versions. For IRI input, the | |||
| input method editor should be set to half-width input, in order to | input method editor should be set to half-width input, in order to | |||
| produce US-ASCII characters where possible. | produce US-ASCII characters where possible. | |||
| An input field primarily or only used for the input of URIs/IRIs | An input field primarily or only used for the input of URIs/IRIs | |||
| should allow the user to view an IRI as mapped to a URI. Places | should allow the user to view an IRI as mapped to a URI. Places | |||
| where the input of IRIs is frequent should provide the possibility | where the input of IRIs is frequent should provide the possibility | |||
| for viewing an IRI as mapped to a URI. This will help users when | for viewing an IRI as mapped to a URI. This will help users when | |||
| some of the software they use does not yet accept IRIs. | some of the software they use does not yet accept IRIs. | |||
| An IRI input component that interfaces to components that handle | An IRI input component that interfaces to components that handle | |||
| URIs, but not IRIs, must map the the IRI to an URI before passing it | URIs, but not IRIs, must map the IRI to a URI before passing it to | |||
| to such a component. | such a component. | |||
| For the input of IRIs with right-to-left characters, please see | For the input of IRIs with right-to-left characters, please see | |||
| Section 4.3. | Section 4.3. | |||
| 6.3 URI/IRI Transfer Between Applications | 7.3 URI/IRI Transfer Between Applications | |||
| Many applications, in particular many mail user agents, try to detect | Many applications, in particular many mail user agents, try to detect | |||
| URIs appearing in plain text. For this, they use some heuristics | URIs appearing in plain text. For this, they use some heuristics | |||
| based on URI syntax. They then allow the user to click on such URIs | based on URI syntax. They then allow the user to click on such URIs | |||
| and retrieve the corresponding resource in an appropriate (usually | and retrieve the corresponding resource in an appropriate (usually | |||
| scheme-dependent) application. | scheme-dependent) application. | |||
| Such applications have to be upgraded to use the IRI syntax rather | Such applications have to be upgraded to use the IRI syntax rather | |||
| than the URI syntax as a base for heuristics. In particular, a non- | than the URI syntax as a base for heuristics. In particular, a non- | |||
| ASCII character should not be taken as the indication of the end of | ASCII character should not be taken as the indication of the end of | |||
| skipping to change at page 24, line 20 | skipping to change at page 26, line 32 | |||
| application where the IRI appears to the encoding used by the system- | application where the IRI appears to the encoding used by the system- | |||
| wide IRI invocation mechanism, or to an URI (according to Section | wide IRI invocation mechanism, or to an URI (according to Section | |||
| 3.1) if the system-wide invocation mechanism only accepts URIs. | 3.1) if the system-wide invocation mechanism only accepts URIs. | |||
| The clipboard is another frequently used way to transfer URIs and | The clipboard is another frequently used way to transfer URIs and | |||
| IRIs from one application to another. On most platforms, the | IRIs from one application to another. On most platforms, the | |||
| clipboard is able to store and transfer text in many languages and | clipboard is able to store and transfer text in many languages and | |||
| scripts. Correctly used, the clipboard transfers characters, not | scripts. Correctly used, the clipboard transfers characters, not | |||
| bytes, which will do the right thing with IRIs. | bytes, which will do the right thing with IRIs. | |||
| 6.4 URI/IRI Generation | 7.4 URI/IRI Generation | |||
| Systems that are offering resources through the Internet, where those | Systems that offer resources through the Internet, where those | |||
| resources have logical names, sometimes automatically generate URIs | resources have logical names, sometimes automatically generate URIs | |||
| for the resources they offer. For example, some HTTP servers can | for the resources they offer. For example, some HTTP servers can | |||
| generate a directory listing for a file directory, and then respond | generate a directory listing for a file directory, and then respond | |||
| to the generated URIs with the files. | to the generated URIs with the files. | |||
| Many legacy character encodings are in use in various file systems. | Many legacy character encodings are in use in various file systems. | |||
| Many currently deployed systems do not transform the local character | Many currently deployed systems do not transform the local character | |||
| representation of the underlying system before generating URIs. | representation of the underlying system before generating URIs. | |||
| For maximum interoperability, systems that generate resource | For maximum interoperability, systems that generate resource | |||
| identifiers should do the appropriate transformations. They should | identifiers should do the appropriate transformations. For example, | |||
| use IRIs converted to URIs in cases where it cannot be expected that | if a file system contains a file named résumé.html, a | |||
| the recipient is able to handle IRIs. Due to the way most user | server should expose this as r%C3%A9sum%C3%A9.html in an URI, which | |||
| agents currently work, native IRIs, encoded in UTF-8, may be used if | allows to use résumé.html in an IRI, even if the file name | |||
| the recipient announces that it can interpret UTF-8. This requires | locally is kept in an encoding other than UTF-8. | |||
| that the whole page is sent as UTF-8. If this is not possible, | ||||
| escaping can always be used. | ||||
| This recommendation in particular applies to HTTP servers. For FTP | This recommendation in particular applies to HTTP servers. For FTP | |||
| servers, similar considerations apply, see in particular [RFC2640]. | servers, similar considerations apply, see in particular [RFC2640]. | |||
| 6.5 URI/IRI Selection | 7.5 URI/IRI Selection | |||
| In some cases, resource owners and publishers have control over the | In some cases, resource owners and publishers have control over the | |||
| IRIs used to identify their resources. Such control is mostly | IRIs used to identify their resources. Such control is mostly | |||
| executed by controlling the resource names, such as file names, | executed by controlling the resource names, such as file names, | |||
| directly. | directly. | |||
| In such cases, it is recommended to avoid choosing IRIs that are | In such cases, it is recommended to avoid choosing IRIs that are | |||
| easily confused. For example, for US-ASCII, the lower-case ell "l" | easily confused. For example, for US-ASCII, the lower-case ell "l" | |||
| is easily confused with the digit one "1", and the upper-case oh "O" | is easily confused with the digit one "1", and the upper-case oh "O" | |||
| is easily confused with the digit zero "0". Publishers should avoid | is easily confused with the digit zero "0". Publishers should avoid | |||
| skipping to change at page 25, line 38 | skipping to change at page 27, line 49 | |||
| Greek 'Alpha', and the Cyrillic 'A'. To avoid such cases, only IRIs | Greek 'Alpha', and the Cyrillic 'A'. To avoid such cases, only IRIs | |||
| should be generated where all the characters in a single component | should be generated where all the characters in a single component | |||
| are used together in a given language. This usually means that all | are used together in a given language. This usually means that all | |||
| these characters will be from the same script, but there are | these characters will be from the same script, but there are | |||
| languages that mix characters from different scripts (such as | languages that mix characters from different scripts (such as | |||
| Japanese). This is similar to the heuristics used to distinguish | Japanese). This is similar to the heuristics used to distinguish | |||
| between letters and numbers in the examples above. Also, for Latin, | between letters and numbers in the examples above. Also, for Latin, | |||
| Greek, and Cyrillic, using lower-case letters results in fewer | Greek, and Cyrillic, using lower-case letters results in fewer | |||
| ambiguities than using upper-case letters. | ambiguities than using upper-case letters. | |||
| 6.6 Display of URIs/IRIs | 7.6 Display of URIs/IRIs | |||
| In situations where the rendering software is not expected to display | In situations where the rendering software is not expected to display | |||
| non-ASCII parts of the IRI correctly using the available layout and | non-ASCII parts of the IRI correctly using the available layout and | |||
| font resources, these parts should be escaped before being displayed. | font resources, these parts should be escaped before being displayed. | |||
| For display of Bidi IRIs, please see Section 4.1. | For display of Bidi IRIs, please see Section 4.1. | |||
| 6.7 Interpretation of URIs and IRIs | 7.7 Interpretation of URIs and IRIs | |||
| Software that interprets IRIs as the names of local resources should | Software that interprets IRIs as the names of local resources should | |||
| accept IRIs in multiple forms, and convert and match them with the | accept IRIs in multiple forms, and convert and match them with the | |||
| appropriate local resource names. | appropriate local resource names. | |||
| First, multiple representations include both IRIs in the native | First, multiple representations include both IRIs in the native | |||
| character encoding of the protocol and also their URI counterparts. | character encoding of the protocol and also their URI counterparts. | |||
| Second, it may include URIs constructed based on other character | Second, it may include URIs constructed based on other character | |||
| encodings than UTF-8. Such URIs may be produced by user agents that | encodings than UTF-8. Such URIs may be produced by user agents that | |||
| skipping to change at page 26, line 33 | skipping to change at page 28, line 43 | |||
| the accents on received IRIs or resource names where appropriate. | the accents on received IRIs or resource names where appropriate. | |||
| Please note that such mappings, including case mappings, are | Please note that such mappings, including case mappings, are | |||
| language-dependent. | language-dependent. | |||
| It can be difficult to unambiguously identify a resource if too many | It can be difficult to unambiguously identify a resource if too many | |||
| mappings are taken into consideration. However, escaped and non- | mappings are taken into consideration. However, escaped and non- | |||
| escaped parts of IRIs can always clearly be distinguished. Also, the | escaped parts of IRIs can always clearly be distinguished. Also, the | |||
| regularity of UTF-8 (see [Duerst97]) makes the potential for | regularity of UTF-8 (see [Duerst97]) makes the potential for | |||
| collisions lower than it may seem at first sight. | collisions lower than it may seem at first sight. | |||
| 6.8 Upgrading Strategy | 7.8 Upgrading Strategy | |||
| Where this recommendation places further constraints on software for | Where this recommendation places further constraints on software for | |||
| which many instances are already deployed, it is important to | which many instances are already deployed, it is important to | |||
| introduce upgrades carefully, and to be aware of the various | introduce upgrades carefully, and to be aware of the various | |||
| interdependencies. | interdependencies. | |||
| If IRIs cannot be interpreted correctly, they should not be generated | If IRIs cannot be interpreted correctly, they should not be generated | |||
| or transported. This suggests that upgrading URI interpreting | or transported. This suggests that upgrading URI interpreting | |||
| software to accept IRIs should have highest priority. | software to accept IRIs should have highest priority. | |||
| skipping to change at page 27, line 19 | skipping to change at page 29, line 30 | |||
| is known to transport them safely. | is known to transport them safely. | |||
| Display software should be upgraded only after upgraded entry | Display software should be upgraded only after upgraded entry | |||
| software has been widely deployed to the population that will see the | software has been widely deployed to the population that will see the | |||
| displayed result. | displayed result. | |||
| These recommendations, when taken together, will allow for the | These recommendations, when taken together, will allow for the | |||
| extension from URIs to IRIs in order to handle scripts other than | extension from URIs to IRIs in order to handle scripts other than | |||
| ASCII while minimizing interoperability problems. | ASCII while minimizing interoperability problems. | |||
| 7. Security Considerations | 8. Security Considerations | |||
| Incorrect escaping or unescaping can lead to security problems. In | Incorrect escaping or unescaping can lead to security problems. In | |||
| particular, some UTF-8 decoders do not check against overlong byte | particular, some UTF-8 decoders do not check against overlong byte | |||
| sequences. As an example, a '/' is encoded with the byte 0x2F both | sequences. As an example, a '/' is encoded with the byte 0x2F both | |||
| in UTF-8 and in ASCII, but some UTF-8 decoders also wrongly interpret | in UTF-8 and in ASCII, but some UTF-8 decoders also wrongly interpret | |||
| the sequence 0xC0 0xAF as a '/'. A sequence such as '%C0%AF..' may | the sequence 0xC0 0xAF as a '/'. A sequence such as '%C0%AF..' may | |||
| pass some security tests and then be interpreted as '/..' in a path | pass some security tests and then be interpreted as '/..' in a path | |||
| if UTF-8 decoders are fault-tolerant, if conversion and checking are | if UTF-8 decoders are fault-tolerant, if conversion and checking are | |||
| not done in the right order, and/or if reserved characters and | not done in the right order, and/or if reserved characters and | |||
| unreserved characters are not clearly distinguished. | unreserved characters are not clearly distinguished. | |||
| skipping to change at page 27, line 41 | skipping to change at page 30, line 4 | |||
| There are various ways in which "spoofing" can occur with IRIs. | There are various ways in which "spoofing" can occur with IRIs. | |||
| "Spoofing" means that somebody may add a resource name that looks the | "Spoofing" means that somebody may add a resource name that looks the | |||
| same or similar to the user, but points to a different resource. The | same or similar to the user, but points to a different resource. The | |||
| added resource may pretend to be the real resource by looking very | added resource may pretend to be the real resource by looking very | |||
| similar, but may contain all kinds of changes that may be difficult | similar, but may contain all kinds of changes that may be difficult | |||
| to spot but can cause all kinds of problems. Most spoofing | to spot but can cause all kinds of problems. Most spoofing | |||
| possibilities for IRIs are extensions of those for URIs. | possibilities for IRIs are extensions of those for URIs. | |||
| Spoofing can occur for various reasons. A first reason is that | Spoofing can occur for various reasons. A first reason is that | |||
| normalization expectations of a user or actual normalization when | normalization expectations of a user or actual normalization when | |||
| entering an IRI do not match the normalization used on the server | entering an IRI, or when transcoding an IRI from a legacy encoding, | |||
| side. Conceptually, this is no different from the problems | do not match the normalization used on the server side. | |||
| surrounding the use of case-insensitive web servers. For example, a | Conceptually, this is no different from the problems surrounding the | |||
| popular web page with a mixed case name (http://big.site/ | use of case-insensitive web servers. For example, a popular web page | |||
| PopularPage.html) might be "spoofed" by someone who is able to create | with a mixed case name (http://big.site/PopularPage.html) might be | |||
| http://big.site/popularpage.html. However, the introduction of | "spoofed" by someone who is able to create http://big.site/ | |||
| character normalization, and of additional mappings for user | popularpage.html. However, the introduction of character | |||
| convenience, may increase the chance for spoofing. | normalization, and of additional mappings for user convenience, may | |||
| increase the chance for spoofing. Protocols and servers that allow | ||||
| the creation of resources with unnormalized names, and resources with | ||||
| names that are not normalized, are particularly vulnerable to such | ||||
| attacks. This is an inherent security problem of the relevant | ||||
| protocol, server, or resource, and not specific to IRIs, but | ||||
| mentioned here for completeness. | ||||
| Spoofing can occur because in the UCS, there are many characters that | Spoofing can occur because in the UCS, there are many characters that | |||
| look very similar. Details are discussed in Section 6.5. Again, | look very similar. Details are discussed in Section 7.5. Again, | |||
| this is very similar to spoofing possibilities on US-ASCII, e.g. | this is very similar to spoofing possibilities on US-ASCII, e.g. | |||
| using 'br0ken' or '1ame' URIs. | using 'br0ken' or '1ame' URIs. | |||
| Spoofing can occur when URIs in various encodings are accepted to | Spoofing can occur when URIs in various encodings are accepted to | |||
| deal with older user agents. In some cases, in particular for Latin- | deal with older user agents. In some cases, in particular for Latin- | |||
| based resource names, this is usually easy to detect because UTF-8- | based resource names, this is usually easy to detect because UTF-8- | |||
| encoded names, when interpreted and viewed as legacy encodings, | encoded names, when interpreted and viewed as legacy encodings, | |||
| produce mostly garbage. In other cases, when concurrently used | produce mostly garbage. In other cases, when concurrently used | |||
| encodings have a similar structure, but there are no characters that | encodings have a similar structure, but there are no characters that | |||
| have exactly the same encoding, detection is more difficult. | have exactly the same encoding, detection is more difficult. | |||
| skipping to change at page 28, line 17 | skipping to change at page 30, line 34 | |||
| Spoofing can occur when URIs in various encodings are accepted to | Spoofing can occur when URIs in various encodings are accepted to | |||
| deal with older user agents. In some cases, in particular for Latin- | deal with older user agents. In some cases, in particular for Latin- | |||
| based resource names, this is usually easy to detect because UTF-8- | based resource names, this is usually easy to detect because UTF-8- | |||
| encoded names, when interpreted and viewed as legacy encodings, | encoded names, when interpreted and viewed as legacy encodings, | |||
| produce mostly garbage. In other cases, when concurrently used | produce mostly garbage. In other cases, when concurrently used | |||
| encodings have a similar structure, but there are no characters that | encodings have a similar structure, but there are no characters that | |||
| have exactly the same encoding, detection is more difficult. | have exactly the same encoding, detection is more difficult. | |||
| Spoofing can occur in various IRI components, such as the domain name | Spoofing can occur in various IRI components, such as the domain name | |||
| part or a path part. For considerations specific to the domain name | part or a path part. For considerations specific to the domain name | |||
| part, see [Nameprep]. For the path part, administrators of sites | part, see [RFC3491]. For the path part, administrators of sites | |||
| which allow independent users to create resources in the same subarea | which allow independent users to create resources in the same subarea | |||
| may need to be careful to check for spoofing. | may need to be careful to check for spoofing. | |||
| Spoofing can occur with bidirectional IRIs, if the restrictions in | Spoofing can occur with bidirectional IRIs, if the restrictions in | |||
| Section 4.2 are not followed. The same visual representation may be | Section 4.2 are not followed. The same visual representation may be | |||
| interpreted as different logical representations, and vice versa. It | interpreted as different logical representations, and vice versa. It | |||
| is also very important that a correct Unicode bidirectional | is also very important that a correct Unicode bidirectional | |||
| implementation is used. | implementation is used. | |||
| 8. Issues List | 9. Acknowledgements | |||
| - Should characters in iadditional be allowed? Under what | ||||
| conditions?. | ||||
| - Allign the description in Section 2.3 with the results of W3C | ||||
| TAG discussions on issue URIEquivalence. | ||||
| - Adapt depending on how [IDNURI] is integrated into | ||||
| [RFC2396bis]. | ||||
| 9. Change log | ||||
| 9.1 Changes from -02 to -03 | ||||
| - Added an issues list. | ||||
| - Added a paragraph prohibiting conversions from URIs to IRIs not | ||||
| based on UTF-8 to Section 3.2. | ||||
| - Introduced iadditional to combine unwise, delims, and space. | ||||
| - Tweaked description and added examples for URI-to-IRI | ||||
| conversion. | ||||
| - Improved syntax rules for hostname part. | ||||
| - Improved description of equivalences in Section 2.3. | ||||
| - Improved description of URI-to-IRI-mapping in Section 3.2. | ||||
| - Changed preferred case when hex-escaping from lower to UPPER. | ||||
| - Fixed various details. | ||||
| 9.2 Changes from -01 to -02 | ||||
| - New approach for Bidi section, many examples. | ||||
| - Created idelims, removed '%' and '#'. Changed userinfo to | ||||
| iuserinfo in iserver. | ||||
| - Changed to ABNF defined by [RFC2234]. | ||||
| - Included bug fixes from [RFC2396bis]. | ||||
| - Additions to Acknowledgements. | ||||
| 9.3 Changes from -00 to -01 | ||||
| - Re-integrated the section on Bidi, some issues left. | ||||
| - Integrated IDN, changed syntax (host, userinfo,....). | ||||
| - Moved some text around, marked some as informational. | ||||
| - Made a clear distinction of IRI use for identification only and | ||||
| for resource resolution. | ||||
| - Fixed various details in wording, spelling,... | ||||
| 10. Acknowledgements | ||||
| We would like to thank Larry Masinter for his work as coauthor of | We would like to thank Larry Masinter for his work as coauthor of | |||
| many earlier versions of this document (draft-masinter-url-i18n-xx). | many earlier versions of this document (draft-masinter-url-i18n-xx). | |||
| The discussion on the issue addressed here has started a long time | The discussion on the issue addressed here has started a long time | |||
| ago. There was a thread in the HTML working group in August 1995 | ago. There was a thread in the HTML working group in August 1995 | |||
| (under the topic of "Globalizing URIs") and in the www-international | (under the topic of "Globalizing URIs") and in the www-international | |||
| mailing list in July 1996 (under the topic of "Internationalization | mailing list in July 1996 (under the topic of "Internationalization | |||
| and URLs"), and ad-hoc meetings at the Unicode conferences in | and URLs"), and ad-hoc meetings at the Unicode conferences in | |||
| September 1995 and September 1997. | September 1995 and September 1997. | |||
| Thanks to Francois Yergeau, Matti Allouche, Roy Fielding, Tim | Thanks to Francois Yergeau, Matti Allouche, Roy Fielding, Tim | |||
| Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim | Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim | |||
| Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie | Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie | |||
| Daigle, Ted Hardie, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex | Daigle, Ted Hardie, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex | |||
| Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Dan Oscarson, | Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Dan | |||
| Elliotte Rusty Harold, Mike J. Brown, Carlos Viegas Damasio, and | Oscarson, Elliotte Rusty Harold, Mike J. Brown, Simon Josefsson, | |||
| many others for help with understanding the issues and possible | Carlos Viegas Damasio, and many others for help with understanding | |||
| solutions, and getting the details right. Thanks also to the members | the issues and possible solutions, and getting the details right. | |||
| of the W3C I18N Working Group and Interest Group for their | Thanks also to the members of the W3C I18N Working Group and Interest | |||
| contributions and their work on [CharMod], to the members of many | Group for their contributions and their work on [CharMod], to the | |||
| other W3C WGs for adopting the ideas, and to the members of the | members of many other W3C WGs for adopting the ideas, and to the | |||
| Montreal IAB Workshop on Internationalization and Localization for | members of the Montreal IAB Workshop on Internationalization and | |||
| their review. | Localization for their review. | |||
| Normative References | Normative References | |||
| [ISO10646] International Organization for Standardization, | [ISO10646] International Organization for Standardization, | |||
| "Information Technology - Universal Multiple-Octet Coded | "Information Technology - Universal Multiple-Octet Coded | |||
| Character Set (UCS) - Part 1: Architecture and Basic | Character Set (UCS) - Part 1: Architecture and Basic | |||
| Multilingual Plane - Part 2: Supplementary Planes", ISO | Multilingual Plane - Part 2: Supplementary Planes", ISO | |||
| Standard 10646, with amendment, July 2002. | Standard 10646, with amendment, July 2002. | |||
| [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | |||
| Specifications: ABNF", RFC 2234, November 1997. | Specifications: ABNF", RFC 2234, November 1997. | |||
| [RFC2279] Yergeau, F., "UTF-8, a transformation format of ISO | [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, | |||
| 10646", RFC 2279, January 1998. | "Internationalizing Domain Names in Applications (IDNA)", | |||
| RFC 3490, March 2003, <http://www.ietf.org/rfc/ | ||||
| rfc3490.txt>. | ||||
| [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform | [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | |||
| Resource Identifiers (URI): Generic Syntax", RFC 2396, | Profile for Internationalized Domain Names (IDN)", RFC | |||
| August 1998. | 3491, March 2003. | |||
| [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for | [RFCXXXX] Yergeau, F., "UTF-8, a transformation format of ISO | |||
| Literal IPv6 Addresses in URL's", RFC 2732, December | 10646", draft-yergeau-rfc2279bis-05.txt (work in | |||
| 1999. | progress), June 2003, <http://www.ietf.org/internet- | |||
| drafts/draft-yergeau-rfc2279bis-05.txt>. | ||||
| [RFCXXXX] Faltstrom, P., Hoffman, P. and A. Costello, | [RFCYYYY] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform | |||
| "Internationalizing Domain Names in Applications (IDNA)", | Resource Identifier (URI): Generic Syntax", draft- | |||
| draft-ietf-idn-idna-14.txt (work in progress), October | fielding-uri-rfc2396bis-03.txt (work in progress), June | |||
| 2002, <http://www.ietf.org/internet-drafts/draft-ietf- | 2003. | |||
| idn-idna-14.txt>. | ||||
| [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | |||
| Unicode Standard Annex #15, March 2001, <http:// | Unicode Standard Annex #15, March 2001, <http:// | |||
| www.unicode.org/unicode/reports/tr15/tr15-21.html>. | www.unicode.org/unicode/reports/tr15/tr15-21.html>. | |||
| Non-normative References | Non-normative References | |||
| [BidiEx] "Examples of bidirectional IRIs", <http://www.w3.org/ | [BidiEx] "Examples of bidirectional IRIs", <http://www.w3.org/ | |||
| International/iri-edit/BidiExamples>. | International/iri-edit/BidiExamples>. | |||
| skipping to change at page 31, line 31 | skipping to change at page 32, line 35 | |||
| From Specification to Testing", Proc. 19th | From Specification to Testing", Proc. 19th | |||
| International Unicode Conference, San Jose , | International Unicode Conference, San Jose , | |||
| September 2001, <http://www.w3.org/2001/Talks/0912- | September 2001, <http://www.w3.org/2001/Talks/0912- | |||
| IUC-IRI/paper.html>. | IUC-IRI/paper.html>. | |||
| [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 | [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 | |||
| Specification", World Wide Web Consortium | Specification", World Wide Web Consortium | |||
| Recommendation, December 1999, <http://www.w3.org/TR/ | Recommendation, December 1999, <http://www.w3.org/TR/ | |||
| REC-html40/appendix/notes.html#h-B.2>. | REC-html40/appendix/notes.html#h-B.2>. | |||
| [IDNURI] Duerst, M., "Internationalized Domain Names in URIs", | ||||
| draft-ietf-idn-uri-03.txt (work in progress), | ||||
| November 2002, <http://www.ietf.org/internet-drafts/ | ||||
| draft-ietf-idn-uri-03.txt>. | ||||
| [Nameprep] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | ||||
| Profile for Internationalized Domain Names", draft- | ||||
| ietf-idn-nameprep-11.txt (work in progress), June | ||||
| 2002, <http://www.ietf.org/internet-drafts/draft- | ||||
| ietf-idn-nameprep-11.txt>. | ||||
| [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
| Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
| [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, | [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, | |||
| H., Atkinson, R., Crispin, M. and P. Svanberg, "The | H., Atkinson, R., Crispin, M. and P. Svanberg, "The | |||
| Report of the IAB Character Set Workshop held 29 | Report of the IAB Character Set Workshop held 29 | |||
| February - 1 March, 1996", RFC 2130, April 1997. | February - 1 March, 1996", RFC 2130, April 1997. | |||
| [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | |||
| [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September | [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September | |||
| 1997. | 1997. | |||
| [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | |||
| Languages", BCP 18, RFC 2277, January 1998. | Languages", BCP 18, RFC 2277, January 1998. | |||
| [RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998. | [RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998. | |||
| [RFC2396bis] Berners-Lee, T., Fielding, R. and L. Masinter, | [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, | |||
| "Uniform Resource Identifier (URI): Generic Syntax", | "Uniform Resource Identifiers (URI): Generic Syntax", | |||
| Internet-Draft (work in progress), October 2002. | RFC 2396, August 1998. | |||
| [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, | [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, | |||
| August 1998. | August 1998. | |||
| [RFC2616] Fielding, R., Gettys, J., Mogul, J., Nielsen, H., | [RFC2616] Fielding, R., Gettys, J., Mogul, J., Nielsen, H., | |||
| Masinter, L., Leach, P. and T. Berners-Lee, | Masinter, L., Leach, P. and T. Berners-Lee, | |||
| "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, | "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, | |||
| June 1999. | June 1999. | |||
| [RFC2640] Curtin, B., "Internationalization of the File | [RFC2640] Curtin, B., "Internationalization of the File | |||
| Transfer Protocol", RFC 2640, July 1999. | Transfer Protocol", RFC 2640, July 1999. | |||
| [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. | [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. | |||
| Petke, "Guidelines for new URL Schemes", RFC 2718, | Petke, "Guidelines for new URL Schemes", RFC 2718, | |||
| November 1999. | November 1999. | |||
| [UNIV3] The Unicode Consortium, "The Unicode Standard Version | [UNIV4] The Unicode Consortium, "The Unicode Standard, | |||
| 3.0", Addison-Wesley, Reading, MA , 2000. | Version 4.0", Addison-Wesley, Reading, MA , 2003. | |||
| [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode | [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode | |||
| Standard Annex #9, March 2002, <http:// | Standard Annex #9, March 2002, <http:// | |||
| www.unicode.org/unicode/reports/tr9>. | www.unicode.org/unicode/reports/tr9>. | |||
| [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other | [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other | |||
| Markup Languages", Unicode Technical Report #20, | Markup Languages", Unicode Technical Report #20, | |||
| World Wide Web Consortium Note, February 2002, | World Wide Web Consortium Note, February 2002, | |||
| <http://www.w3.org/TR/unicode-xml/>. | <http://www.w3.org/TR/unicode-xml/>. | |||
| skipping to change at page 33, line 21 | skipping to change at page 34, line 14 | |||
| [XMLNamespace] Bray, T., Hollander, D. and A. Layman, "Namespaces in | [XMLNamespace] Bray, T., Hollander, D. and A. Layman, "Namespaces in | |||
| XML", World Wide Web Consortium Recommendation, | XML", World Wide Web Consortium Recommendation, | |||
| January 1999, <http://www.w3.org/TR/REC-xml#sec- | January 1999, <http://www.w3.org/TR/REC-xml#sec- | |||
| external-ent>. | external-ent>. | |||
| [XMLSchema] Biron, P. and A. Malhotra, "XML Schema Part 2: | [XMLSchema] Biron, P. and A. Malhotra, "XML Schema Part 2: | |||
| Datatypes", World Wide Web Consortium Recommendation, | Datatypes", World Wide Web Consortium Recommendation, | |||
| May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>. | May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>. | |||
| [XPointer] Grosso, P., Maler, E., Marsh, J. and N. Walsh, | ||||
| "XPointer Framework", World Wide Web Consortium | ||||
| Recommendation, March 2003, <http://www.w3.org/TR/ | ||||
| xptr-framework/#escaping>. | ||||
| Authors' Addresses | Authors' Addresses | |||
| Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever | Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever | |||
| possible, for example as "Dürst in XML and HTML.) | possible, for example as "Dürst in XML and HTML.) | |||
| World Wide Web Consortium | World Wide Web Consortium | |||
| 200 Technology Square | 200 Technology Square | |||
| Cambridge, MA 02139 | Cambridge, MA 02139 | |||
| U.S.A. | U.S.A. | |||
| Phone: +1 617 253 5509 | Phone: +1 617 253 5509 | |||
| End of changes. | ||||
This html diff was produced by rfcdiff 1.12, available from http://www.levkowetz.com/ietf/tools/rfcdiff/ | ||||