| draft-duerst-iri-07.txt | draft-duerst-iri-08.txt | |||
|---|---|---|---|---|
| Network Working Group M. Duerst | Network Working Group M. Duerst | |||
| Internet-Draft W3C | Internet-Draft W3C | |||
| Expires: November 7, 2004 M. Suignard | Expires: November 26, 2004 M. Suignard | |||
| Microsoft Corporation | Microsoft Corporation | |||
| May 9, 2004 | May 28, 2004 | |||
| Internationalized Resource Identifiers (IRIs) | Internationalized Resource Identifiers (IRIs) | |||
| draft-duerst-iri-07 | draft-duerst-iri-08 | |||
| Status of this Memo | Status of this Memo | |||
| By submitting this Internet-Draft, I certify that any applicable | By submitting this Internet-Draft, I certify that any applicable | |||
| patent or other IPR claims of which I am aware have been disclosed, | patent or other IPR claims of which I am aware have been disclosed, | |||
| and any of which I become aware will be disclosed, in accordance with | and any of which I become aware will be disclosed, in accordance with | |||
| RFC 3668. | RFC 3668. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that other | Task Force (IETF), its areas, and its working groups. Note that | |||
| groups may also distribute working documents as Internet-Drafts. | other groups may also distribute working documents as | |||
| Internet-Drafts. | ||||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at http:// | The list of current Internet-Drafts can be accessed at | |||
| www.ietf.org/ietf/1id-abstracts.txt. | http://www.ietf.org/ietf/1id-abstracts.txt. | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
| This Internet-Draft will expire on November 7, 2004. | This Internet-Draft will expire on November 26, 2004. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (2004). All Rights Reserved. | Copyright (C) The Internet Society (2004). All Rights Reserved. | |||
| Abstract | Abstract | |||
| This document defines a new protocol element, the Internationalized | This document defines a new protocol element, the Internationalized | |||
| Resource Identifier (IRI), as a complement to the Uniform Resource | Resource Identifier (IRI), as a complement to the Uniform Resource | |||
| Identifier (URI). An IRI is a sequence of characters from the | Identifier (URI). An IRI is a sequence of characters from the | |||
| skipping to change at page 2, line 9 | skipping to change at page 2, line 10 | |||
| URIs is defined, which means that IRIs can be used instead of URIs | URIs is defined, which means that IRIs can be used instead of URIs | |||
| where appropriate to identify resources. | where appropriate to identify resources. | |||
| The approach of defining a new protocol element was chosen, instead | The approach of defining a new protocol element was chosen, instead | |||
| of extending or changing the definition of URIs, to allow a clear | of extending or changing the definition of URIs, to allow a clear | |||
| distinction and to avoid incompatibilities with existing software. | distinction and to avoid incompatibilities with existing software. | |||
| Guidelines for the use and deployment of IRIs in various protocols, | Guidelines for the use and deployment of IRIs in various protocols, | |||
| formats, and software components that now deal with URIs are | formats, and software components that now deal with URIs are | |||
| provided. | provided. | |||
| Editorial Note | ||||
| This document is a product of the Internationalization Working Group | ||||
| (I18N WG) of the World Wide Web Consortium (W3C). For general | ||||
| discussion, please use the public-iri@w3.org mailing list (publicly | ||||
| archived at http://lists.w3.org/Archives/Public/public-iri/). An | ||||
| issues list for this document is maintained at http://www.w3.org/ | ||||
| International/iri-edit#issues. For more information on the topic of | ||||
| this document, please also see [W3CIRI] and [Duerst01]. | ||||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . 4 | 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . 4 | |||
| 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . 4 | 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . 5 | 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 7 | 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 7 | |||
| 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 7 | 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 7 | |||
| 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . 8 | 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . 8 | |||
| 3. Relationship between IRIs and URIs . . . . . . . . . . . . . . 10 | 3. Relationship between IRIs and URIs . . . . . . . . . . . . . . 11 | |||
| 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . 10 | 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . 11 | |||
| 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . 13 | 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . 14 | |||
| 3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . 15 | 3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . 16 | |||
| 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . . 16 | 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . . 17 | |||
| 4.1 Logical Storage and Visual Presentation . . . . . . . . . 17 | 4.1 Logical Storage and Visual Presentation . . . . . . . . . 17 | |||
| 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . 18 | 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . 19 | |||
| 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . 19 | 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . 20 | |||
| 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 19 | 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 20 | |||
| 5. IRI Equivalence and Comparison . . . . . . . . . . . . . . . . 21 | 5. IRI Equivalence and Comparison . . . . . . . . . . . . . . . . 22 | |||
| 5.1 Simple String Comparison . . . . . . . . . . . . . . . . . 21 | 5.1 Simple String Comparison . . . . . . . . . . . . . . . . . 22 | |||
| 5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . 22 | 5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . 23 | |||
| 5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . 22 | 5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . 23 | |||
| 5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . 23 | 5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . 24 | |||
| 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 24 | 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 25 | |||
| 6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . 24 | 6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . 25 | |||
| 6.2 Software Interfaces and Protocols . . . . . . . . . . . . 24 | 6.2 Software Interfaces and Protocols . . . . . . . . . . . . 25 | |||
| 6.3 Format of URIs and IRIs in Documents and Protocols . . . . 25 | 6.3 Format of URIs and IRIs in Documents and Protocols . . . . 26 | |||
| 6.4 Use of UTF-8 for Encoding Original Characters . . . . . . 25 | 6.4 Use of UTF-8 for Encoding Original Characters . . . . . . 26 | |||
| 6.5 Relative IRI References . . . . . . . . . . . . . . . . . 26 | 6.5 Relative IRI References . . . . . . . . . . . . . . . . . 27 | |||
| 7. URI/IRI Processing Guidelines (informative) . . . . . . . . . 26 | 7. URI/IRI Processing Guidelines (informative) . . . . . . . . . 27 | |||
| 7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . 26 | 7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . 27 | |||
| 7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 27 | 7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 28 | |||
| 7.3 URI/IRI Transfer Between Applications . . . . . . . . . . 28 | 7.3 URI/IRI Transfer Between Applications . . . . . . . . . . 29 | |||
| 7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 28 | 7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 29 | |||
| 7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 29 | 7.5 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 30 | |||
| 7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 29 | 7.6 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 30 | |||
| 7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . 30 | 7.7 Interpretation of URIs and IRIs . . . . . . . . . . . . . 31 | |||
| 7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 30 | 7.8 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 31 | |||
| 8. Security Considerations . . . . . . . . . . . . . . . . . . . 31 | 8. Security Considerations . . . . . . . . . . . . . . . . . . . 32 | |||
| 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 33 | 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 34 | |||
| 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 33 | 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 34 | |||
| 10.1 Normative References . . . . . . . . . . . . . . . . . . . . 33 | 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 34 | |||
| 10.2 Non-normative References . . . . . . . . . . . . . . . . . . 34 | 11.1 Normative References . . . . . . . . . . . . . . . . . . . . 34 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 36 | 11.2 Non-normative References . . . . . . . . . . . . . . . . . . 35 | |||
| A. Design Alternatives . . . . . . . . . . . . . . . . . . . . . 37 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 38 | |||
| A.1 New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 37 | A. Design Alternatives . . . . . . . . . . . . . . . . . . . . . 38 | |||
| A.2 Other Character Encodings than UTF-8 . . . . . . . . . . . 37 | A.1 New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 38 | |||
| A.3 New Encoding Convention . . . . . . . . . . . . . . . . . 38 | A.2 Other Character Encodings than UTF-8 . . . . . . . . . . . 39 | |||
| A.4 Indicating Character Encodings in the URI/IRI . . . . . . 38 | A.3 New Encoding Convention . . . . . . . . . . . . . . . . . 39 | |||
| Intellectual Property and Copyright Statements . . . . . . . . 39 | A.4 Indicating Character Encodings in the URI/IRI . . . . . . 39 | |||
| Intellectual Property and Copyright Statements . . . . . . . . 40 | ||||
| 1. Introduction | 1. Introduction | |||
| 1.1 Overview and Motivation | 1.1 Overview and Motivation | |||
| A URI is defined in [RFCYYYY] as a sequence of characters chosen from | A Uniform Resource Identifier (URI) is defined in [RFCYYYY] as a | |||
| a limited subset of the repertoire of US-ASCII characters. | sequence of characters chosen from a limited subset of the repertoire | |||
| of US-ASCII [ASCII] characters. | ||||
| The characters in URIs are frequently used for representing words of | The characters in URIs are frequently used for representing words of | |||
| natural languages. Such usage has many advantages: such URIs are | natural languages. Such usage has many advantages: such URIs are | |||
| easier to memorize, easier to interpret, easier to transcribe, easier | easier to memorize, easier to interpret, easier to transcribe, easier | |||
| to create, and easier to guess. For most languages other than | to create, and easier to guess. For most languages other than | |||
| English, however, the natural script uses characters other than A-Z. | English, however, the natural script uses characters other than A-Z. | |||
| For many people, handling Latin characters is as difficult as | For many people, handling Latin characters is as difficult as | |||
| handling the characters of other scripts is for people who use only | handling the characters of other scripts is for people who use only | |||
| the Latin alphabet. Many languages with non-Latin scripts have | the Latin alphabet. Many languages with non-Latin scripts have | |||
| transcriptions to Latin letters. Such transcriptions are now often | transcriptions to Latin letters. Such transcriptions are now often | |||
| used in URIs, but they introduce additional ambiguities. | used in URIs, but they introduce additional ambiguities. | |||
| The infrastructure for the appropriate handling of characters from | The infrastructure for the appropriate handling of characters from | |||
| local scripts is now widely deployed in local versions of operating | local scripts is now widely deployed in local versions of operating | |||
| system and application software. Software that can handle a wide | system and application software. Software that can handle a wide | |||
| variety of scripts and languages at the same time is increasingly | variety of scripts and languages at the same time is increasingly | |||
| widespread. Also, there are increasing numbers of protocols and | widespread. Also, there are increasing numbers of protocols and | |||
| formats that can carry a wide range of characters. | formats that can carry a wide range of characters. | |||
| This document defines a new protocol element, called IRI | This document defines a new protocol element, called | |||
| (Internationalized Resource Identifier), by extending the syntax of | Internationalized Resource Identifier (IRI), by extending the syntax | |||
| URIs to a much wider repertoire of characters. It also defines | of URIs to a much wider repertoire of characters. It also defines | |||
| "internationalized" versions corresponding to other constructs from | "internationalized" versions corresponding to other constructs from | |||
| [RFCYYYY], such as URI references. | [RFCYYYY], such as URI references. The syntax of IRIs is defined in | |||
| Section 2, and the relationship between IRIs and URIs in Section 3. | ||||
| Using characters outside of A-Z in IRIs brings with it some | Using characters outside of A-Z in IRIs brings with it some | |||
| difficulties; a discussion of potential problems and workarounds can | difficulties. Section 4 discusses the special case of bidirectional | |||
| be found in the later sections of this document. | IRIs, Section 5 various forms of equivalence between IRIs, and | |||
| Section 6 the use of IRIs in different situations. Section 7 gives | ||||
| additional informative guidelines, and Section 8 security | ||||
| considerations. | ||||
| For discussion of this document, please use the public-iri@w3.org | ||||
| mailing list (publicly archived at | ||||
| http://lists.w3.org/Archives/Public/public-iri/). An issues list for | ||||
| this document is maintained at | ||||
| http://www.w3.org/International/iri-edit#issues. For more | ||||
| information on the topic of this document, please also see [W3CIRI] | ||||
| and [Duerst01]. | ||||
| 1.2 Applicability | 1.2 Applicability | |||
| IRIs are designed to be compatible with recent recommendations for | IRIs are designed to be compatible with recent recommendations for | |||
| new URI schemes [RFC2718]. The compatibility is provided by | new URI schemes [RFC2718]. The compatibility is provided by | |||
| specifying a well defined and deterministic mapping from the IRI | specifying a well defined and deterministic mapping from the IRI | |||
| character sequence to the functionally equivalent URI character | character sequence to the functionally equivalent URI character | |||
| sequence. Practical use of IRIs (or IRI references) in place of URIs | sequence. Practical use of IRIs (or IRI references) in place of URIs | |||
| (or URI references) depends on the following conditions being met: | (or URI references) depends on the following conditions being met: | |||
| a) The protocol or format element used should be explicitly | a) The protocol or format element where IRIs are used should be | |||
| designated to carry IRIs. That is, the intent is not to introduce | explicitly designated to be able to carry IRIs. That is, the | |||
| IRIs into contexts that are not defined to accept them. For | intent is not to introduce IRIs into contexts that are not defined | |||
| example, XML schema [XMLSchema] has an explicit type "anyURI" that | to accept them. For example, XML schema [XMLSchema] has an | |||
| designates the use of IRIs. | explicit type "anyURI" that includes IRIs and IRI references. | |||
| Therefore, IRIs and IRI references can be in attributes and | ||||
| elements of type "anyURI". On the other hand, in the HTTP | ||||
| protocol [RFC2616], the Request URI is defined as an URI, which | ||||
| means that direct use of IRIs is not allowed in HTTP requests. | ||||
| b) The protocol or format carrying the IRIs should have a mechanism | b) The protocol or format carrying the IRIs should have a mechanism | |||
| to represent the wide range of characters used in IRIs, either | to represent the wide range of characters used in IRIs, either | |||
| natively or by some protocol- or format-specific escaping | natively or by some protocol- or format-specific escaping | |||
| mechanism (for example numeric character references in [XML1]). | mechanism (for example numeric character references in [XML1]). | |||
| c) The URI corresponding to the IRI in question has to encode | c) The URI corresponding to the IRI in question has to encode | |||
| original characters into octets using UTF-8. For new URI schemes, | original characters into octets using UTF-8. For new URI schemes, | |||
| this is recommended in [RFC2718]. It can apply to a whole scheme | this is recommended in [RFC2718]. It can apply to a whole scheme | |||
| (e.g. IMAP URLs [RFC2192] and POP URLs [RFC2384], or the URN | (e.g. IMAP URLs [RFC2192] and POP URLs [RFC2384], or the URN | |||
| skipping to change at page 5, line 49 | skipping to change at page 6, line 20 | |||
| (unambiguously) converting a sequence of octets into a sequence of | (unambiguously) converting a sequence of octets into a sequence of | |||
| characters. | characters. | |||
| charset: The name of a parameter or attribute used to identify a | charset: The name of a parameter or attribute used to identify a | |||
| character encoding. | character encoding. | |||
| UCS: Universal Character Set; the coded character set defined by ISO/ | UCS: Universal Character Set; the coded character set defined by ISO/ | |||
| IEC 10646 [ISO10646] and the Unicode Standard [UNIV4]. | IEC 10646 [ISO10646] and the Unicode Standard [UNIV4]. | |||
| IRI reference: The term "IRI reference" denotes the common usage of | IRI reference: The term "IRI reference" denotes the common usage of | |||
| an internationalized resource identifier. An IRI reference may be | an Internationalized Resource Identifier. An IRI reference may be | |||
| absolute or relative. However, the "IRI" that results from such a | absolute or relative. However, the "IRI" that results from such a | |||
| reference only includes absolute IRIs; any relative IRIs are | reference only includes absolute IRIs; any relative IRIs are | |||
| resolved to their absolute form. Note that in [RFC2396], URIs did | resolved to their absolute form. Note that in [RFC2396], URIs did | |||
| not include fragment identifiers, but in [RFCYYYY], fragment | not include fragment identifiers, but in [RFCYYYY], fragment | |||
| identifiers are part of URIs. | identifiers are part of URIs. | |||
| running text: Human text (paragraphs, sentences, phrases) with syntax | running text: Human text (paragraphs, sentences, phrases) with syntax | |||
| according to orthographic conventions of a natural language, as | according to orthographic conventions of a natural language, as | |||
| opposed to syntax defined for ease of processing by machines | opposed to syntax defined for ease of processing by machines | |||
| (markup, programming languages,...). | (markup, programming languages,...). | |||
| protocol element: Any portion of a message which affects processing | protocol element: Any portion of a message which affects processing | |||
| of that message by the protocol in question. | of that message by the protocol in question. | |||
| presentation element: Presentation form corresponding to a protocol | presentation element: Presentation form corresponding to a protocol | |||
| element, for example using a wider range of characters. | element, for example using a wider range of characters. | |||
| create (an URI or IRI): With respect to URIs and IRIs, the word | create (an URI or IRI): With respect to URIs and IRIs, the word | |||
| 'create' is used for the initial creation. This may be the initial | 'create' is used for the initial creation. This may be the | |||
| creation of a resource with a certain name, or the initial | initial creation of a resource with a certain name, or the initial | |||
| exposition of a resource under a particular name. | exposition of a resource under a particular name. | |||
| generate (an URI or IRI): With respect to URIs and IRIs, the word | generate (an URI or IRI): With respect to URIs and IRIs, the word | |||
| 'generate' is used when the IRI is generated by derivation from | 'generate' is used when the IRI is generated by derivation from | |||
| other information. | other information. | |||
| 1.4 Notation | 1.4 Notation | |||
| RFCs and Internet Drafts currently do not allow any characters | RFCs and Internet Drafts currently do not allow any characters | |||
| outside the US-ASCII repertoire. Therefore, this document uses | outside the US-ASCII repertoire. Therefore, this document uses | |||
| skipping to change at page 6, line 46 | skipping to change at page 7, line 17 | |||
| using a prefix of 'U+', followed by four to six hexadecimal digits. | using a prefix of 'U+', followed by four to six hexadecimal digits. | |||
| To represent characters outside US-ASCII in examples, this document | To represent characters outside US-ASCII in examples, this document | |||
| uses two notations called 'XML Notation' and 'Bidi Notation'. | uses two notations called 'XML Notation' and 'Bidi Notation'. | |||
| XML Notation uses leading '&#x', trailing ';', and the hexadecimal | XML Notation uses leading '&#x', trailing ';', and the hexadecimal | |||
| number of the character in the UCS in between. Example: я | number of the character in the UCS in between. Example: я | |||
| stands for CYRILLIC CAPITAL LETTER YA. In this notation, an actual | stands for CYRILLIC CAPITAL LETTER YA. In this notation, an actual | |||
| '&' is denoted by '&'. | '&' is denoted by '&'. | |||
| Bidi Notation is used for bidirectional examples: lower case ASCII | Bidi Notation is used for bidirectional examples: lower case letters | |||
| letters stand for Latin letters or other letters that are written | stand for Latin letters or other letters that are written | |||
| left-to-right, whereas upper case letters represent Arabic or Hebrew | left-to-right, whereas upper case letters represent Arabic or Hebrew | |||
| letters that are written right-to-left. | letters that are written right-to-left. | |||
| To denote actual octets in examples (as opposed to percent-encoded | To denote actual octets in examples (as opposed to percent-encoded | |||
| octets), the two hex digits denoting the octet are enclosed in "<" | octets), the two hex digits denoting the octet are enclosed in "<" | |||
| and ">". For example, the octet often denoted as 0xc9 is denoted here | and ">". For example, the octet often denoted as 0xc9 is denoted | |||
| as <c9>. | here as <c9>. | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
| document are to be interpreted as described in [RFC2119]. | document are to be interpreted as described in [RFC2119]. | |||
| 2. IRI Syntax | 2. IRI Syntax | |||
| This section defines the syntax of Internationalized Resource | This section defines the syntax of Internationalized Resource | |||
| Identifiers (IRIs). | Identifiers (IRIs). | |||
| skipping to change at page 7, line 43 | skipping to change at page 8, line 12 | |||
| unreserved characters is extended by adding the characters of the UCS | unreserved characters is extended by adding the characters of the UCS | |||
| (Universal Character Set, [ISO10646]) beyond U+007F, subject to the | (Universal Character Set, [ISO10646]) beyond U+007F, subject to the | |||
| limitations given in the syntax rules below and in Section 6.1. | limitations given in the syntax rules below and in Section 6.1. | |||
| Otherwise, the syntax and use of components and reserved characters | Otherwise, the syntax and use of components and reserved characters | |||
| is the same as that in [RFCYYYY]. All the operations defined in | is the same as that in [RFCYYYY]. All the operations defined in | |||
| [RFCYYYY], such as the resolution of relative URIs, can be applied to | [RFCYYYY], such as the resolution of relative URIs, can be applied to | |||
| IRIs by IRI-processing software in exactly the same way as this is | IRIs by IRI-processing software in exactly the same way as this is | |||
| done to URIs by URI-processing software. | done to URIs by URI-processing software. | |||
| Characters outside the US-ASCII range are not reserved and therefore | Characters outside the US-ASCII repertoire are not reserved and | |||
| MUST NOT be used for syntactical purposes such as to delimit | therefore MUST NOT be used for syntactical purposes such as to | |||
| components in newly defined schemes. As an example, it is not allowed | delimit components in newly defined schemes. As an example, it is | |||
| to use U+00A2, CENT SIGN, as a delimiter in IRIs, because it is in | not allowed to use U+00A2, CENT SIGN, as a delimiter in IRIs, because | |||
| the 'iunreserved' category, in the same way as it is not possible to | it is in the 'iunreserved' category, in the same way as it is not | |||
| use '-' as a delimiter, because it is in the 'unreserved' category in | possible to use '-' as a delimiter, because it is in the 'unreserved' | |||
| URIs. | category in URIs. | |||
| 2.2 ABNF for IRI References and IRIs | 2.2 ABNF for IRI References and IRIs | |||
| While it might be possible to define IRI references and IRIs merely | While it might be possible to define IRI references and IRIs merely | |||
| by their transformation to URI references and URIs, they can also be | by their transformation to URI references and URIs, they can also be | |||
| accepted and processed directly. Therefore, an ABNF definition for | accepted and processed directly. Therefore, an ABNF definition for | |||
| IRI references (which are the most general concept and the start of | IRI references (which are the most general concept and the start of | |||
| the grammar) and IRIs is given here. The syntax of this ABNF is | the grammar) and IRIs is given here. The syntax of this ABNF is | |||
| described in [RFC2234]. Character numbers are taken from the UCS, | described in [RFC2234]. Character numbers are taken from the UCS, | |||
| without implying any actual binary encoding. Terminals in the ABNF | without implying any actual binary encoding. Terminals in the ABNF | |||
| are characters, not bytes. | are characters, not bytes. | |||
| The following grammar closely follows the URI grammar in [RFCYYYY], | ||||
| except that the range of unreserved characters is expanded to include | ||||
| UCS characters, with the restriction that private UCS characters can | ||||
| occur only in query parts and not elsewhere. The grammar is split | ||||
| into two parts, rules that differ from [RFCYYYY] because of the | ||||
| above-mentioned expansion, and rules that are the same as in | ||||
| [RFCYYYY]. For rules that are different than in [RFCYYYY], the names | ||||
| of the non-terminals have been changed as follows: If the | ||||
| non-terminal contains 'URI', this has been changed to 'IRI'. | ||||
| Otherwise, an 'i' has been prefixed. | ||||
| The following rules are different from [RFCYYYY]: | The following rules are different from [RFCYYYY]: | |||
| IRI = scheme ":" ihier-part [ "?" iquery ] | IRI = scheme ":" ihier-part [ "?" iquery ] | |||
| [ "#" ifragment ] | [ "#" ifragment ] | |||
| ihier-part = "//" iauthority ipath-abempty | ihier-part = "//" iauthority ipath-abempty | |||
| / ipath-abs | / ipath-abs | |||
| / ipath-rootless | / ipath-rootless | |||
| / ipath-empty | / ipath-empty | |||
| IRI-reference = IRI / relative-IRI | IRI-reference = IRI / relative-IRI | |||
| absolute-IRI = scheme ":" ihier-part [ "?" iquery ] | absolute-IRI = scheme ":" ihier-part [ "?" iquery ] | |||
| relative-IRI = irelative-part [ "?" iquery ] [ "#" ifragment ] | relative-IRI = irelative-part [ "?" iquery ] [ "#" ifragment ] | |||
| irelative-part = "//" iauthority ipath-abempty | irelative-part = "//" iauthority ipath-abempty | |||
| / ipath-abs | / ipath-abs | |||
| / ipath-noscheme | / ipath-noscheme | |||
| / ipath-empty | / ipath-empty | |||
| skipping to change at page 11, line 11 | skipping to change at page 11, line 39 | |||
| Scheme-specific restrictions are applied to IRIs by converting | Scheme-specific restrictions are applied to IRIs by converting | |||
| IRIs to URIs and checking the URIs against the scheme-specific | IRIs to URIs and checking the URIs against the scheme-specific | |||
| restrictions. | restrictions. | |||
| b) Interpretational: URIs identify resources in various ways. IRIs | b) Interpretational: URIs identify resources in various ways. IRIs | |||
| also identify resources. When the IRI is used solely for | also identify resources. When the IRI is used solely for | |||
| identification purposes, it is not necessary to map the IRI to a | identification purposes, it is not necessary to map the IRI to a | |||
| URI (see Section 5). However, when an IRI is used for resource | URI (see Section 5). However, when an IRI is used for resource | |||
| retrieval, the resource that the IRI locates is the same as the | retrieval, the resource that the IRI locates is the same as the | |||
| one located by the URI obtained after converting the IRI according | one located by the URI obtained after converting the IRI according | |||
| to the procedure defined here. This means that there is no need to | to the procedure defined here. This means that there is no need | |||
| define resolution separately on the IRI level. | to define resolution separately on the IRI level. | |||
| Applications MUST map IRIs to URIs using the following two steps. | Applications MUST map IRIs to URIs using the following two steps. | |||
| Step 1) This step generates a UCS-based character encoding from the | Step 1) This step generates a UCS character sequence from the | |||
| original IRI format. This step has three variants, depending on | original IRI format. This step has three variants, depending on | |||
| the form of the input. | the form of the input. | |||
| Variant A) If the IRI is written on paper or read out loud, or | Variant A) If the IRI is written on paper or read out loud, or | |||
| otherwise represented as a sequence of characters independent | otherwise represented as a sequence of characters independent | |||
| of any character encoding: Represent the IRI as a sequence of | of any character encoding: Represent the IRI as a sequence of | |||
| characters from the UCS normalized according to Normalization | characters from the UCS normalized according to Normalization | |||
| Form C (NFC, [UTR15]). | Form C (NFC, [UTR15]). | |||
| Variant B) If the IRI is in some digital representation (e.g. an | Variant B) If the IRI is in some digital representation (e.g. an | |||
| octet stream) in some known non-Unicode character encoding: | octet stream) in some known non-Unicode character encoding: | |||
| Convert the IRI to a sequence of characters from the UCS | Convert the IRI to a sequence of characters from the UCS | |||
| normalized according to NFC. | normalized according to NFC. | |||
| Variant C) If the IRI is in an Unicode-based character encoding | Variant C) If the IRI is in an Unicode-based character encoding | |||
| (for example UTF-8 or UTF-16): Do not normalize. Move directly | (for example UTF-8 or UTF-16): Do not normalize. Apply Step 2 | |||
| to Step 2. | directly to the encoded Unicode character sequence. | |||
| Step 2) For each character that is disallowed in URI references, | Step 2) For each character in 'ucschar' or 'iprivate', apply Steps | |||
| apply Steps 2.1 through 2.3 below. The disallowed characters | 2.1 through 2.3 below. | |||
| consist of all non-ASCII characters allowed in IRIs. | ||||
| 2.1) Convert the character to a sequence of one or more octets | 2.1) Convert the character to a sequence of one or more octets | |||
| using UTF-8 [RFC3629]. | using UTF-8 [RFC3629]. | |||
| 2.2) Convert each octet to %HH, where HH is the hexadecimal | 2.2) Convert each octet to %HH, where HH is the hexadecimal | |||
| notation of the octet value. Note: This is identical to the | notation of the octet value. Note that this is identical to | |||
| percent-encoding mechanism in Section 2.1 of [RFCYYYY]. To | the percent-encoding mechanism in Section 2.1 of [RFCYYYY]. To | |||
| reduce variability, the hexadecimal notation SHOULD use upper | reduce variability, the hexadecimal notation SHOULD use upper | |||
| case letters. | case letters. | |||
| 2.3) Replace the original character by the resulting character | 2.3) Replace the original character by the resulting character | |||
| sequence (i.e. a sequence of %HH triplets). | sequence (i.e. a sequence of %HH triplets). | |||
| The above mapping from IRIs to URIs produces URIs fully conforming to | The above mapping from IRIs to URIs produces URIs fully conforming to | |||
| [RFCYYYY]. The mapping is also an identity transformation for URIs | [RFCYYYY]. The mapping is also an identity transformation for URIs | |||
| and is idempotent -- applying the mapping a second time will not | and is idempotent -- applying the mapping a second time will not | |||
| change anything. Every URI is by definition an IRI. | change anything. Every URI is by definition an IRI. | |||
| Infrastructure accepting IRIs MAY convert the ireg-name component of | Infrastructure accepting IRIs MAY convert the ireg-name component of | |||
| an IRI as follows (before Step 2.2 above) for schemes that are known | an IRI as follows (before Step 2 above) for schemes that are known to | |||
| to use domain names in ireg-name, but where the scheme definition | use domain names in ireg-name, but where the scheme definition does | |||
| does not allow percent-encoding for ireg-name: Replace the ireg-name | not allow percent-encoding for ireg-name: Replace the ireg-name part | |||
| part of the IRI by the part converted using the ToASCII operation | of the IRI by the part converted using the ToASCII operation | |||
| specified in Section 4.1 of [RFC3490] on each dot-separated label, | specified in Section 4.1 of [RFC3490] on each dot-separated label, | |||
| and using U+002E (FULL STOP) as a label separator, with the flag | and using U+002E (FULL STOP) as a label separator, with the flag | |||
| UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set to | UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set to | |||
| FALSE for creating IRIs and set to TRUE otherwise. The ToASCII | FALSE for creating IRIs and set to TRUE otherwise. The ToASCII | |||
| operation may fail, but this would mean that the IRI cannot be | operation may fail, but this would mean that the IRI cannot be | |||
| resolved. This conversion SHOULD be used when the goal is to maximize | resolved. This conversion SHOULD be used when the goal is to | |||
| interoperability with legacy URI resolvers. For example, the IRI | maximize interoperability with legacy URI resolvers. For example, | |||
| the IRI | ||||
| http://résumé.example.org may be converted to | http://résumé.example.org may be converted to | |||
| http://xn--rsum-bpad.example.org instead of | http://xn--rsum-bpad.example.org instead of | |||
| http://r%C3%A9sum%C3%A9.example.org. | http://r%C3%A9sum%C3%A9.example.org. | |||
| An IRI with a scheme that is known to use domain names in ireg-name, | An IRI with a scheme that is known to use domain names in ireg-name, | |||
| but where the scheme definition does not allow percent-encoding for | but where the scheme definition does not allow percent-encoding for | |||
| ireg-name, meets scheme-specific restrictions if either the | ireg-name, meets scheme-specific restrictions if either the | |||
| straightforward conversion or the conversion using the ToASCII | straightforward conversion or the conversion using the ToASCII | |||
| operation on ireg-name result in an URI that meets the | operation on ireg-name result in an URI that meets the | |||
| scheme-specific restrictions. An IRI with a scheme that is known to | scheme-specific restrictions. An IRI with a scheme that is known to | |||
| use domain names in ireg-name, but where the scheme definition does | use domain names in ireg-name, but where the scheme definition does | |||
| not allow percent-encoding for ireg-name, resolves to the URI | not allow percent-encoding for ireg-name, resolves to the URI | |||
| obtained after converting the IRI including using the ToASCII | obtained after converting the IRI including using the ToASCII | |||
| operation on ireg-name. Implementations do not need to do this | operation on ireg-name. Implementations do not need to do this | |||
| conversion as long as they produce the same result. | conversion as long as they produce the same result. | |||
| Note: The uniform treatment of the whole IRI in Step 2.2 above is | Note: The difference between Variants B and C in Step 1 (Variant B | |||
| using normalization with NFC while Variant C not using any | ||||
| normalization) is to account for the fact that in many non-Unicode | ||||
| character encodings, some text cannot be represented directly. | ||||
| For example, Vietnam is natively written "Việt Nam" | ||||
| (containing a LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW) | ||||
| in NFC, but a direct transcoding from the windows-1258 character | ||||
| encoding leads to "Việt Nam" (containing a LATIN SMALL | ||||
| LETTER E WITH CIRCUMFLEX followed by a COMBINING DOT BELOW), | ||||
| whereas direct transcoding of other 8-bit encodings of Vietnamese | ||||
| may lead to other representations. | ||||
| Note: The uniform treatment of the whole IRI in Step 2 above is | ||||
| important to not make processing dependent on URI scheme. See | important to not make processing dependent on URI scheme. See | |||
| [Gettys] for an in-depth discussion. | [Gettys] for an in-depth discussion. | |||
| Note: In practice, the difference above will not be noticed if | Note: In practice, the difference above will not be noticed if | |||
| mapping from IRI to URI and resolution is tightly integrated (e.g. | mapping from IRI to URI and resolution is tightly integrated (e.g. | |||
| carried out in the same user agent). But conversion using | carried out in the same user agent). But conversion using | |||
| [RFC3490] may be able to better deal with backwards compatibility | [RFC3490] may be able to better deal with backwards compatibility | |||
| issues in case mapping and resolution are separated, as in the | issues in case mapping and resolution are separated, as in the | |||
| case of using an HTTP proxy. | case of using an HTTP proxy. | |||
| Note: Internationalized Domain Names may be contained in parts of an | Note: Internationalized Domain Names may be contained in parts of an | |||
| IRI other than the ireg-name part. It is the responsibility of | IRI other than the ireg-name part. It is the responsibility of | |||
| scheme-specific implementations (if the Internationalized Domain | scheme-specific implementations (if the Internationalized Domain | |||
| Name is part of the scheme syntax) or of server-side | Name is part of the scheme syntax) or of server-side | |||
| implementations (if the Internationalized Domain Name is part of | implementations (if the Internationalized Domain Name is part of | |||
| 'iquery') to apply the necessary conversions at the appropriate | 'iquery') to apply the necessary conversions at the appropriate | |||
| point. Example: Trying to validate the Web page at | point. Example: Trying to validate the Web page at | |||
| http://résumé.example.org would lead to an IRI of | http://résumé.example.org would lead to an IRI of | |||
| http://validator.w3.org/ | http://validator.w3.org/check?uri=http%3A%2F%2Frésumé.example.org, | |||
| check?uri=http%3A%2F%2Frésumé.example.org, which would | which would convert to a URI of | |||
| convert to a URI of | http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.example.org. | |||
| http://validator.w3.org/ | The server side implementation would be responsible to do the | |||
| check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.example.org. The server | necessary conversions in order to be able to retrieve the Web | |||
| side implementation would be responsible to do the necessary | page. | |||
| conversions in order to be able to retrieve the Web page. | ||||
| Infrastructure accepting IRIs MAY also deal with the printable | Infrastructure accepting IRIs MAY also deal with the printable | |||
| characters in US-ASCII that are not allowed in URIs, namely "<", ">", | characters in US-ASCII that are not allowed in URIs, namely "<", ">", | |||
| '"', Space, "{", "}", "|", "\", "^", and "`", in Step 2.2 above. If | '"', Space, "{", "}", "|", "\", "^", and "`", in Step 2 above. If | |||
| such characters are found but are not converted, then the conversion | such characters are found but are not converted, then the conversion | |||
| SHOULD fail. Please note that the number sign ("#"), the percent sign | SHOULD fail. Please note that the number sign ("#"), the percent | |||
| ("%"), and the square bracket characters ("[", "]") are not part of | sign ("%"), and the square bracket characters ("[", "]") are not part | |||
| the above list, and MUST NOT be converted. Protocols and formats that | of the above list, and MUST NOT be converted. Protocols and formats | |||
| have used earlier definitions of IRIs including these characters MAY | that have used earlier definitions of IRIs including these characters | |||
| require percent-encoding of these characters as a preprocessing step | MAY require percent-encoding of these characters as a preprocessing | |||
| to extract the actual IRI from a given field. Such preprocessing MAY | step to extract the actual IRI from a given field. Such | |||
| also be used by applications allowing the user to enter an IRI. | preprocessing MAY also be used by applications allowing the user to | |||
| enter an IRI. | ||||
| Note: In this process (in Step 2.3), characters allowed in URI | Note: In this process (in Step 2.3), characters allowed in URI | |||
| references as well as existing percent-encoded sequences are not | references as well as existing percent-encoded sequences are not | |||
| encoded further. (This mapping is similar to, but different from, | encoded further. (This mapping is similar to, but different from, | |||
| the encoding applied when including arbitrary content into some | the encoding applied when including arbitrary content into some | |||
| part of a URI.) For example, an IRI of | part of a URI.) For example, an IRI of | |||
| http://www.example.org/red%09rosé#red (in XML notation) is | http://www.example.org/red%09rosé#red (in XML notation) is | |||
| converted to | converted to | |||
| http://www.example.org/red%09ros%C3%A9#red, not to something like | http://www.example.org/red%09ros%C3%A9#red, not to something like | |||
| http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red. | http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red. | |||
| skipping to change at page 14, line 5 | skipping to change at page 14, line 44 | |||
| conversion to a URI is: | conversion to a URI is: | |||
| http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82 | http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82 | |||
| 3.2 Converting URIs to IRIs | 3.2 Converting URIs to IRIs | |||
| In some situations, it may be desirable to try to convert a URI into | In some situations, it may be desirable to try to convert a URI into | |||
| an equivalent IRI. This section gives a procedure to do such a | an equivalent IRI. This section gives a procedure to do such a | |||
| conversion. The conversion described in this section will always | conversion. The conversion described in this section will always | |||
| result in an IRI which maps back to the URI that was used as an input | result in an IRI which maps back to the URI that was used as an input | |||
| for the conversion (except for potential case differences in | for the conversion (except for potential case differences in | |||
| percent-encoding). However, the IRI resulting from this conversion | percent-encoding and for potential percent-encoded unreserved | |||
| may not be exactly the same as the original IRI (if there ever was | characters). However, the IRI resulting from this conversion may not | |||
| one). | be exactly the same as the original IRI (if there ever was one). | |||
| URI to IRI conversion removes percent-encodings, but not all | URI to IRI conversion removes percent-encodings, but not all | |||
| percent-encodings can be eliminated. There are several reasons for | percent-encodings can be eliminated. There are several reasons for | |||
| this: | this: | |||
| a) Some percent-encodings are necessary to distinguish | a) Some percent-encodings are necessary to distinguish | |||
| percent-encoded and unencoded uses of reserved characters. | percent-encoded and unencoded uses of reserved characters. | |||
| b) Some percent-encodings cannot be interpreted as sequences of UTF-8 | b) Some percent-encodings cannot be interpreted as sequences of UTF-8 | |||
| octets. | octets. | |||
| (Note: The octet patterns of UTF-8 are highly regular. Therefore, | (Note: The octet patterns of UTF-8 are highly regular. Therefore, | |||
| there is a very high probability, but no guarantee, that | there is a very high probability, but no guarantee, that | |||
| percent-encodings that can be interpreted as sequences of UTF-8 | percent-encodings that can be interpreted as sequences of UTF-8 | |||
| octets actually originated from UTF-8. For a detailed discussion, | octets actually originated from UTF-8. For a detailed discussion, | |||
| see [Duerst97].) | see [Duerst97].) | |||
| c) The conversion may result in a character that is not appropriate | c) The conversion may result in a character that is not appropriate | |||
| in an IRI. See Section 6.1 for further details. | in an IRI. See Section 2.2, Section 4.1, and Section 6.1 for | |||
| further details. | ||||
| Conversion from a URI to an IRI is done using the following steps (or | Conversion from a URI to an IRI is done using the following steps (or | |||
| any other algorithm that produces the same result): | any other algorithm that produces the same result): | |||
| 1) Represent the URI as a sequence of octets in US-ASCII. | 1) Represent the URI as a sequence of octets in US-ASCII. | |||
| 2) Convert all percent-encodings (% followed by two hexadecimal | 2) Convert all percent-encodings (% followed by two hexadecimal | |||
| digits) except those corresponding to '%', characters in | digits) except those corresponding to '%', characters in | |||
| 'reserved', and characters in US-ASCII not allowed in URIs, to the | 'reserved', and characters in US-ASCII not allowed in URIs, to the | |||
| corresponding octets. | corresponding octets. | |||
| 3) Re-percent-encode any octet produced in Step 2 that is not part of | 3) Re-percent-encode any octet produced in Step 2 that is not part of | |||
| a strictly legal UTF-8 octet sequence. | a strictly legal UTF-8 octet sequence. | |||
| 4) Re-percent-encode all octets produced in Step 3 that in UTF-8 | 4) Re-percent-encode all octets produced in Step 3 that in UTF-8 | |||
| represent characters that are not appropriate according to Section | represent characters that are not appropriate according to Section | |||
| 4.1 and Section 6.1. | 2.2, Section 4.1, and Section 6.1. | |||
| 5) Interpret the resulting octet sequence as a sequence of characters | 5) Interpret the resulting octet sequence as a sequence of characters | |||
| encoded in UTF-8. | encoded in UTF-8. | |||
| This procedure will convert as many percent-encoded non-ASCII | This procedure will convert as many percent-encoded characters as | |||
| characters as possible to characters in an IRI. Because there are | possible to characters in an IRI. Because there are some choices | |||
| some choices when applying Step 4 (see Section 6.1), results may | when applying Step 4 (see Section 6.1), results may vary. | |||
| vary. | ||||
| Conversions from URIs to IRIs MUST NOT use any other character | Conversions from URIs to IRIs MUST NOT use any other character | |||
| encoding than UTF-8 in Steps 3 and 4 above, even if it might be | encoding than UTF-8 in Steps 3 and 4 above, even if it might be | |||
| possible from context to guess that another character encoding than | possible from context to guess that another character encoding than | |||
| UTF-8 was used in the URI. As an example, the URI http:// | UTF-8 was used in the URI. As an example, the URI | |||
| www.example.org/r%E9sum%E9.html might with some guessing be | http://www.example.org/r%E9sum%E9.html might with some guessing be | |||
| interpreted to contain two e-acute characters encoded as iso-8859-1. | interpreted to contain two e-acute characters encoded as iso-8859-1. | |||
| It must not be converted to an IRI containing these e-acute | It must not be converted to an IRI containing these e-acute | |||
| characters. Otherwise, the IRI will in the future be mapped to http:/ | characters. Otherwise, the IRI will in the future be mapped to | |||
| /www.example.org/r%C3%A9sum%C3%A9.html, which is a different URI than | http://www.example.org/r%C3%A9sum%C3%A9.html, which is a different | |||
| http://www.example.org/r%E9sum%E9.html. | URI than http://www.example.org/r%E9sum%E9.html. | |||
| 3.2.1 Examples | 3.2.1 Examples | |||
| This section shows various examples of converting URIs to IRIs. The | This section shows various examples of converting URIs to IRIs. Each | |||
| notation <hh> is used to denote octets outside those that can be | example shows the result after applying each of the Steps 1 to 5. | |||
| represented in this document. Each example shows the result after | XML Notation is used for the final result. | |||
| applying each of the Steps 1 to 5. XML Notation is used for the final | ||||
| result. | ||||
| The following example contains the sequence '%C3%BC', which is a | The following example contains the sequence '%C3%BC', which is a | |||
| strictly legal UTF-8 sequence, and which is converted into the actual | strictly legal UTF-8 sequence, and which is converted into the actual | |||
| character U+00FC LATIN SMALL LETTER U WITH DIAERESIS (also known as | character U+00FC LATIN SMALL LETTER U WITH DIAERESIS (also known as | |||
| u-umlaut). | u-umlaut). | |||
| 1) http://www.example.org/D%C3%BCrst | 1) http://www.example.org/D%C3%BCrst | |||
| 2) http://www.example.org/D<c3><bc>rst | 2) http://www.example.org/D<c3><bc>rst | |||
| skipping to change at page 16, line 32 | skipping to change at page 17, line 20 | |||
| 2) http://xn--99zt52a.example.org/<e2><80><ae> | 2) http://xn--99zt52a.example.org/<e2><80><ae> | |||
| 3) http://xn--99zt52a.example.org/<e2><80><ae> | 3) http://xn--99zt52a.example.org/<e2><80><ae> | |||
| 4) http://xn--99zt52a.example.org/%E2%80%AE | 4) http://xn--99zt52a.example.org/%E2%80%AE | |||
| 5) http://xn--99zt52a.example.org/%E2%80%AE | 5) http://xn--99zt52a.example.org/%E2%80%AE | |||
| Implementations with scheme-specific knowledge MAY convert | Implementations with scheme-specific knowledge MAY convert | |||
| punycode-encoded domain name labels to the corresponding characters | punycode-encoded domain name labels to the corresponding characters | |||
| using the ToUnicode procedure. Thus, for the example above, the label | using the ToUnicode procedure. Thus, for the example above, the | |||
| xn--99zt52a may be converted to U+7D0D U+8C46 (Japanese Natto), | label xn--99zt52a may be converted to U+7D0D U+8C46 (Japanese Natto), | |||
| leading to the overall IRI of | leading to the overall IRI of | |||
| http://納豆.example.org/%E2%80%AE | http://納豆.example.org/%E2%80%AE | |||
| 4. Bidirectional IRIs for Right-to-left Languages | 4. Bidirectional IRIs for Right-to-left Languages | |||
| Some UCS characters, such as those used in the Arabic and Hebrew | Some UCS characters, such as those used in the Arabic and Hebrew | |||
| script, have an inherent right-to-left (rtl) writing direction. IRIs | script, have an inherent right-to-left (rtl) writing direction. IRIs | |||
| containing such characters (called bidirectional IRIs or Bidi IRIs) | containing such characters (called bidirectional IRIs or Bidi IRIs) | |||
| require additional attention because of the non-trivial relation | require additional attention because of the non-trivial relation | |||
| between logical representation (used for digital representation as | between logical representation (used for digital representation as | |||
| skipping to change at page 17, line 27 | skipping to change at page 18, line 15 | |||
| syntax rules (which includes the rules relevant to their scheme). | syntax rules (which includes the rules relevant to their scheme). | |||
| This assures that bidirectional IRIs can be processed in the same way | This assures that bidirectional IRIs can be processed in the same way | |||
| as other IRIs. | as other IRIs. | |||
| When rendered, bidirectional IRIs MUST be rendered using the Unicode | When rendered, bidirectional IRIs MUST be rendered using the Unicode | |||
| Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be | Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be | |||
| rendered in the same way as they would be rendered if they were in an | rendered in the same way as they would be rendered if they were in an | |||
| left-to-right embedding, i.e. as if they were preceded by U+202A, | left-to-right embedding, i.e. as if they were preceded by U+202A, | |||
| LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP | LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP | |||
| DIRECTIONAL FORMATTING (PDF). Setting the embedding direction can | DIRECTIONAL FORMATTING (PDF). Setting the embedding direction can | |||
| also be done in a higher-order protocol (e.g. the dir='ltr' attribute | also be done in a higher-level protocol (e.g. the dir='ltr' | |||
| in HTML). | attribute in HTML). | |||
| There is no requirement to actually use the above embedding if the | There is no requirement to actually use the above embedding if the | |||
| display is still the same without the embedding. For example, a | display is still the same without the embedding. For example, a | |||
| bidirectional IRI in a text with left-to-right base directionality | bidirectional IRI in a text with left-to-right base directionality | |||
| (such as used for English or Cyrillic) that is preceded and followed | (such as used for English or Cyrillic) that is preceded and followed | |||
| by whitespace and strong left-to-right characters does not need an | by whitespace and strong left-to-right characters does not need an | |||
| embedding. Also, a bidirectional relative IRI that only contains | embedding. Also, a bidirectional relative IRI that only contains | |||
| strong right-to-left characters and weak characters and that starts | strong right-to-left characters and weak characters and that starts | |||
| and ends with a strong rigth-to-left character and appears in a text | and ends with a strong rigth-to-left character and appears in a text | |||
| with right-to-left base directionality (such as used for Arabic or | with right-to-left base directionality (such as used for Arabic or | |||
| skipping to change at page 18, line 11 | skipping to change at page 18, line 47 | |||
| The Unicode Bidirectional Algorithm ([UNI9], Section 4.3) permits | The Unicode Bidirectional Algorithm ([UNI9], Section 4.3) permits | |||
| higher-level protocols to influence bidirectional rendering. Such | higher-level protocols to influence bidirectional rendering. Such | |||
| changes by higher-level protocols MUST NOT be used if they change the | changes by higher-level protocols MUST NOT be used if they change the | |||
| rendering of IRIs. | rendering of IRIs. | |||
| The bidirectional formatting characters that may be used before or | The bidirectional formatting characters that may be used before or | |||
| after the IRI to assure correct display are themselves not part of | after the IRI to assure correct display are themselves not part of | |||
| the IRI. IRIs MUST NOT contain bidirectional formatting characters | the IRI. IRIs MUST NOT contain bidirectional formatting characters | |||
| (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual | (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual | |||
| rendering of the IRI, but do not themselves appear visually. It would | rendering of the IRI, but do not themselves appear visually. It | |||
| therefore not be possible to correctly input an IRI with such | would therefore not be possible to correctly input an IRI with such | |||
| characters. | characters. | |||
| 4.2 Bidi IRI Structure | 4.2 Bidi IRI Structure | |||
| The Unicode Bidirectional Algorithm is designed mainly for running | The Unicode Bidirectional Algorithm is designed mainly for running | |||
| text. To make sure that it does not affect the rendering of | text. To make sure that it does not affect the rendering of | |||
| bidirectional IRIs too much, some restrictions on bidirectional IRIs | bidirectional IRIs too much, some restrictions on bidirectional IRIs | |||
| are necessary. These restrictions are given in terms of delimiters | are necessary. These restrictions are given in terms of delimiters | |||
| (structural characters, mostly punctuation such as '@', '.', ':', | (structural characters, mostly punctuation such as '@', '.', ':', | |||
| '/') and components (usually consisting mostly of letters and | '/') and components (usually consisting mostly of letters and | |||
| skipping to change at page 18, line 34 | skipping to change at page 19, line 24 | |||
| The following syntax rules from Section 2.2 correspond to components | The following syntax rules from Section 2.2 correspond to components | |||
| for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment, | for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment, | |||
| isegment-nz, isegment-nzc, ireg-name, iquery, and ifragment. | isegment-nz, isegment-nzc, ireg-name, iquery, and ifragment. | |||
| Specifications that define the syntax of any of the above components | Specifications that define the syntax of any of the above components | |||
| MAY divide them further and define smaller parts to be components | MAY divide them further and define smaller parts to be components | |||
| according to this document. As an example, the restrictions of | according to this document. As an example, the restrictions of | |||
| [RFC3490] on bidirectional domain names correspond to treating each | [RFC3490] on bidirectional domain names correspond to treating each | |||
| label of a domain name as a component for those schemes where | label of a domain name as a component for those schemes where | |||
| ireg-name is a domain name. Even where the components are not defined | ireg-name is a domain name. Even where the components are not | |||
| formally, it may be helpful to think about some syntax in terms of | defined formally, it may be helpful to think about some syntax in | |||
| components and to apply the relevant restrictions. For example, for | terms of components and to apply the relevant restrictions. For | |||
| the usual name/value syntax in query parts, it is convenient to treat | example, for the usual name/value syntax in query parts, it is | |||
| each name and each value as a component. As another example, the | convenient to treat each name and each value as a component. As | |||
| extensions in a resource name can be treated as separate components. | another example, the extensions in a resource name can be treated as | |||
| separate components. | ||||
| For each component, the following restrictions apply: | For each component, the following restrictions apply: | |||
| 1) A component SHOULD NOT use both right-to-left and left-to-right | 1) A component SHOULD NOT use both right-to-left and left-to-right | |||
| characters. | characters. | |||
| 2) A component using right-to-left characters SHOULD start and end | 2) A component using right-to-left characters SHOULD start and end | |||
| with right-to-left characters. | with right-to-left characters. | |||
| The above restrictions are given as shoulds, rather than as musts. | The above restrictions are given as shoulds, rather than as musts. | |||
| skipping to change at page 20, line 14 | skipping to change at page 21, line 4 | |||
| inverted as a whole: | inverted as a whole: | |||
| logical representation: http://ab.CDE.FGH/ij/kl/mn/op.html | logical representation: http://ab.CDE.FGH/ij/kl/mn/op.html | |||
| visual representation: http://ab.HGF.EDC/ij/kl/mn/op.html | visual representation: http://ab.HGF.EDC/ij/kl/mn/op.html | |||
| A sequence of rtl components is read rtl, in the same way as a | A sequence of rtl components is read rtl, in the same way as a | |||
| sequence of rtl words is read rtl in a bidi text. | sequence of rtl words is read rtl in a bidi text. | |||
| Example 3: All components of an IRI (except for the scheme) are rtl. | Example 3: All components of an IRI (except for the scheme) are rtl. | |||
| All rtl components are inverted overall: | All rtl components are inverted overall: | |||
| logical representation: http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV | logical representation: http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV | |||
| visual representation: http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA | visual representation: http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA | |||
| The whole IRI (except the scheme) is read rtl. Delimiters between rtl | The whole IRI (except the scheme) is read rtl. Delimiters between | |||
| components stay between the respective components; delimiters between | rtl components stay between the respective components; delimiters | |||
| ltr and rtl components don't move. | between ltr and rtl components don't move. | |||
| Example 4: Several sequences of rtl components are each inverted on | Example 4: Several sequences of rtl components are each inverted on | |||
| their own: | their own: | |||
| logical representation: http://AB.CD.ef/gh/IJ/KL.html | logical representation: http://AB.CD.ef/gh/IJ/KL.html | |||
| visual representation: http://DC.BA.ef/gh/LK/JI.html | visual representation: http://DC.BA.ef/gh/LK/JI.html | |||
| Each sequence of rtl components is read rtl, in the same way as each | Each sequence of rtl components is read rtl, in the same way as each | |||
| sequence of rtl words in an ltr text is read rtl. | sequence of rtl words in an ltr text is read rtl. | |||
| Example 5: Example 2, applied to components of different kinds: | Example 5: Example 2, applied to components of different kinds: | |||
| logical representation: http://ab.cd.EF/GH/ij/kl.html | logical representation: http://ab.cd.EF/GH/ij/kl.html | |||
| skipping to change at page 21, line 27 | skipping to change at page 22, line 17 | |||
| Example 10 (allowed, but not recommended): | Example 10 (allowed, but not recommended): | |||
| logical representation: http://ab.CDEFGH.123/kl/mn/op.html | logical representation: http://ab.CDEFGH.123/kl/mn/op.html | |||
| visual representation: http://ab.123.HGFEDC/kl/mn/op.html | visual representation: http://ab.123.HGFEDC/kl/mn/op.html | |||
| Components consisting of only numbers are allowed (it would be rather | Components consisting of only numbers are allowed (it would be rather | |||
| difficult to prohibit them), but may interact with adjacent RTL | difficult to prohibit them), but may interact with adjacent RTL | |||
| components in ways that are not easy to predict. | components in ways that are not easy to predict. | |||
| 5. IRI Equivalence and Comparison | 5. IRI Equivalence and Comparison | |||
| This section discusses IRI Equivalence and Comparison similar to | This section discusses IRI Equivalence and Comparison similar to | |||
| Section 6, "Normalization and Comparison", in [RFCYYYY]. This section | Section 6, "Normalization and Comparison", in [RFCYYYY]. This | |||
| focuses on the main issues and on aspects that are different from | section focuses on the main issues and on aspects that are different | |||
| [RFCYYYY]; Section 6 of [RFCYYYY] is recommended background reading. | from [RFCYYYY]; Section 6 of [RFCYYYY] is recommended background | |||
| reading. | ||||
| There is no general rule or procedure to decide whether two arbitrary | There is no general rule or procedure to decide whether two arbitrary | |||
| IRIs are equivalent or not (i.e. whether they refer to the same | IRIs are equivalent or not (i.e. whether they refer to the same | |||
| resource or not). Two IRIs that look almost the same may refer to | resource or not). Two IRIs that look almost the same may refer to | |||
| different resources. Two IRIs that look completely different may | different resources. Two IRIs that look completely different may | |||
| refer to the same resource. Each specification or application that | refer to the same resource. Each specification or application that | |||
| uses IRIs has to decide on the appropriate criterion for IRI | uses IRIs has to decide on the appropriate criterion for IRI | |||
| equivalence. | equivalence. | |||
| 5.1 Simple String Comparison | 5.1 Simple String Comparison | |||
| skipping to change at page 21, line 51 | skipping to change at page 22, line 42 | |||
| In some scenarios a definite answer to the question of IRI | In some scenarios a definite answer to the question of IRI | |||
| equivalence is needed that is independent of the scheme used and | equivalence is needed that is independent of the scheme used and | |||
| always can be calculated quickly and without accessing a network. An | always can be calculated quickly and without accessing a network. An | |||
| example of such a case is XML Namespaces ([XMLNamespace]). In such | example of such a case is XML Namespaces ([XMLNamespace]). In such | |||
| cases, two IRIs SHOULD be defined as equivalent if and only if they | cases, two IRIs SHOULD be defined as equivalent if and only if they | |||
| are character-by-character equivalent. This is the same as being | are character-by-character equivalent. This is the same as being | |||
| byte-by-byte equivalent if the character encoding for both IRIs is | byte-by-byte equivalent if the character encoding for both IRIs is | |||
| the same. As an example, | the same. As an example, | |||
| http://example.org/~user, http://example.org/%7euser, and | http://example.org/~user, http://example.org/%7euser, and | |||
| http://example.org/%7Euser are not equivalent under this definition. | http://example.org/%7Euser are not equivalent under this definition. | |||
| In such a case, the comparison function MUST NOT map IRIs to URIs, | When comparing character-by-character, the comparison function MUST | |||
| because such a mapping would create additional spurious equivalences. | NOT map IRIs to URIs, because such a mapping would create additional | |||
| spurious equivalences. | ||||
| It follows that IRIs SHOULD NOT be modified when being transported if | It follows that IRIs SHOULD NOT be modified when being transported if | |||
| there is any chance that this IRI might be used as an identifier in | there is any chance that this IRI might be used as an identifier in | |||
| the way explained above. | the way explained above. When an IRI is used as an identifier in | |||
| scenarios that depend upon character-by-character equivalence, | ||||
| creators of IRIs should take additional care to avoid IRIs that only | ||||
| differ in their use of percent-escaping. As an example, using both | ||||
| http://example.org/~user and http://example.org/%7Euser to identify | ||||
| XML Namespaces is a bad idea. | ||||
| 5.2 Conversion to URIs | 5.2 Conversion to URIs | |||
| For actual resolution, differences in percent-encoding (except for | For actual resolution, differences in percent-encoding (except for | |||
| the percent-encoding of reserved characters) MUST always result in | the percent-encoding of reserved characters) MUST always result in | |||
| the same resource. For example, http://example.org/~user, | the same resource. For example, http://example.org/~user, | |||
| http://example.org/%7euser and http://example.org/%7Euser must | http://example.org/%7euser and http://example.org/%7Euser must | |||
| resolve to the same resource. | resolve to the same resource. | |||
| If this kind of equivalence is to be tested, the percent-encoding of | If this kind of equivalence is to be tested, the percent-encoding of | |||
| both IRIs to be compared has to be aligned, for example by converting | both IRIs to be compared has to be aligned, for example by converting | |||
| both IRIs to URIs (see Section 3.1) and making sure that the case of | both IRIs to URIs (see Section 3.1), eliminating escape differences | |||
| the hexadecimal characters in the percent-encode is always the same | in the resulting URIs, and making sure that the case of the | |||
| (preferably upper case). For comparison, such conversions MUST only | hexadecimal characters in the percent-encodeing is always the same | |||
| be done on the fly, while retaining the original IRI. | (preferably upper case). If the IRI is to be passed to another | |||
| application, or used further in some other way, its original form | ||||
| MUST be preserved; the conversion described here should be performed | ||||
| only for the purpose of local comparison. | ||||
| Additional, similar equivalences are possible based on knowledge | Additional, similar equivalences are possible based on knowledge | |||
| about the generic URI/IRI syntax, such as the fact that the scheme | about the generic URI/IRI syntax, such as the fact that the scheme | |||
| part is case-insensitive. | part is case-insensitive. | |||
| 5.3 Normalization | 5.3 Normalization | |||
| The Unicode Standard [UNIV4] defines various equivalences between | The Unicode Standard [UNIV4] defines various equivalences between | |||
| sequences of characters for various purposes. Unicode Standard Annex | sequences of characters for various purposes. Unicode Standard Annex | |||
| #15 [UTR15] defines various Normalization Forms for these | #15 [UTR15] defines various Normalization Forms for these | |||
| skipping to change at page 22, line 51 | skipping to change at page 23, line 51 | |||
| comparing two IRIs. The exceptions are conversion from a non-digital | comparing two IRIs. The exceptions are conversion from a non-digital | |||
| form, and conversion from a non-UCS-based character encoding to an | form, and conversion from a non-UCS-based character encoding to an | |||
| UCS-based character encoding. In these cases, NFC or a normalizing | UCS-based character encoding. In these cases, NFC or a normalizing | |||
| transcoder using NFC MUST be used for interoperability. To avoid | transcoder using NFC MUST be used for interoperability. To avoid | |||
| false negatives and problems with transcoding, IRIs SHOULD be created | false negatives and problems with transcoding, IRIs SHOULD be created | |||
| using NFC. Using NFKC may avoid even more problems, for example by | using NFC. Using NFKC may avoid even more problems, for example by | |||
| choosing half-width Latin letters instead of full-width, and | choosing half-width Latin letters instead of full-width, and | |||
| full-width Katakana instead of half-width. | full-width Katakana instead of half-width. | |||
| As an example, http://www.example.org/résumé.html (in XML | As an example, http://www.example.org/résumé.html (in XML | |||
| Notation) is in NFC. On the other hand, http://www.example.org/ | Notation) is in NFC. On the other hand, | |||
| résumé.html is not in NFC. The former uses precombined | http://www.example.org/résumé.html is not in NFC. The | |||
| e-acute characters, the later uses 'e' characters followed by | former uses precombined e-acute characters, the later uses 'e' | |||
| combining acute accents. Both usages are defined to be canonically | characters followed by combining acute accents. Both usages are | |||
| equivalent in [UNIV4]. | defined to be canonically equivalent in [UNIV4]. | |||
| Note: Because it is unknown how a particular field is being treated | Note: Because it is unknown how a particular field is being treated | |||
| with respect to text normalization, it would be inappropriate to | with respect to text normalization, it would be inappropriate to | |||
| allow third parties to normalize an IRI arbitrarily. This does not | allow third parties to normalize an IRI arbitrarily. This does | |||
| contradict the recommendation that when a resource is created, its | not contradict the recommendation that when a resource is created, | |||
| IRI should be as normalized as possible (i.e. NFC or even NFKC). | its IRI should be as normalized as possible (i.e. NFC or even | |||
| This is similar to the upper-case/lower-case problems in URIs. | NFKC). This is similar to the upper-case/lower-case problems in | |||
| Some parts of a URI are case-insensitive (domain name). For | URIs. Some parts of a URI are case-insensitive (domain name). | |||
| others, it is unclear whether they are case-sensitive or | For others, it is unclear whether they are case-sensitive or | |||
| case-insensitive, or something in between (e.g. case-sensitive, | case-insensitive, or something in between (e.g. case-sensitive, | |||
| but if the wrong case is used, a multiple choice selection is | but if the wrong case is used, a multiple choice selection is | |||
| provided instead of a direct negative result). The best recipe is | provided instead of a direct negative result). The best recipe is | |||
| that the creator uses a reasonable capitalization, and when | that the creator uses a reasonable capitalization, and when | |||
| transferring the URI, that capitalization is never changed. | transferring the URI, that capitalization is never changed. | |||
| Various IRI schemes may allow the usage of International Domain Names | Various IRI schemes may allow the usage of International Domain Names | |||
| (IDN) [RFC3490]. When in use in IRIs, those names SHOULD be validated | (IDN) [RFC3490]. When in use in IRIs, those names SHOULD be | |||
| using the ToASCII operation defined in [RFC3490], with the flags | validated using the ToASCII operation defined in [RFC3490], with the | |||
| "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing an | flags "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing | |||
| invalid IDN cannot successfully be resolved. For legibility purposes, | an invalid IDN cannot successfully be resolved. For legibility | |||
| IDN components of IRIs SHOULD NOT be converted into ASCII Compatible | purposes, IDN components of IRIs SHOULD NOT be converted into ASCII | |||
| Encoding (ACE). | Compatible Encoding (ACE). | |||
| 5.4 Preferred Forms | 5.4 Preferred Forms | |||
| The following are the preferred forms for IRIs when created: | The following are the preferred forms for IRIs when created: | |||
| - Always provide the URI scheme in lowercase characters. | - Always provide the URI scheme in lowercase characters. | |||
| - Only perform percent-encoding where it is essential. | - Only perform percent-encoding where it is essential. | |||
| - Always use uppercase A-through-F characters when percent-encoding. | - Always use uppercase A-through-F characters when percent-encoding. | |||
| skipping to change at page 24, line 13 | skipping to change at page 25, line 13 | |||
| - Prevent /./ and /../ from appearing in non-relative URI paths. | - Prevent /./ and /../ from appearing in non-relative URI paths. | |||
| - For schemes that define an empty path to be equivalent to a path | - For schemes that define an empty path to be equivalent to a path | |||
| of "/", use "/". | of "/", use "/". | |||
| 6. Use of IRIs | 6. Use of IRIs | |||
| 6.1 Limitations on UCS Characters Allowed in IRIs | 6.1 Limitations on UCS Characters Allowed in IRIs | |||
| This section discusses limitations on characters and character | This section discusses limitations on characters and character | |||
| sequences usable for IRIs. The considerations in this section are | sequences usable for IRIs beyond those given in Section 2.2 and | |||
| relevant when creating IRIs and when converting from URIs to IRIs. | Section 4.1. The considerations in this section are relevant when | |||
| creating IRIs and when converting from URIs to IRIs. | ||||
| a) The repertoire of characters allowed in each IRI component is | a) The repertoire of characters allowed in each IRI component is | |||
| limited by the definition of that component. For example, the | limited by the definition of that component. For example, the | |||
| definition of the scheme component does not allow characters | definition of the scheme component does not allow characters | |||
| beyond US-ASCII. | beyond US-ASCII. | |||
| (Note: In accordance with URI practice, generic IRI software | (Note: In accordance with URI practice, generic IRI software | |||
| cannot and should not check for such limitations.) | cannot and should not check for such limitations.) | |||
| b) The UCS contains many areas of characters for which there are | b) The UCS contains many areas of characters for which there are | |||
| strong visual look-alikes. Because of the likelihood of | strong visual look-alikes. Because of the likelihood of | |||
| transcription errors, these also should be avoided. This includes | transcription errors, these also should be avoided. This includes | |||
| the full-width equivalents of ASCII characters, half-width | the full-width equivalents of Latin characters, half-width | |||
| Katakana characters for Japanese, and many others. This also | Katakana characters for Japanese, and many others. This also | |||
| includes many look-alikes of "space", "delims", and "unwise", | includes many look-alikes of "space", "delims", and "unwise", | |||
| characters excluded in [RFC3491]. | characters excluded in [RFC3491]. | |||
| Additional information is available from [UNIXML]. [UNIXML] is | Additional information is available from [UNIXML]. [UNIXML] is | |||
| written in the context of running text rather than in the context of | written in the context of running text rather than in the context of | |||
| identifiers. Nevertheless, it discusses many of the categories of | identifiers. Nevertheless, it discusses many of the categories of | |||
| characters not appropriate for IRIs. | characters not appropriate for IRIs. | |||
| 6.2 Software Interfaces and Protocols | 6.2 Software Interfaces and Protocols | |||
| skipping to change at page 25, line 35 | skipping to change at page 26, line 35 | |||
| formats and protocols will be required to handle IRIs [CharMod]. | formats and protocols will be required to handle IRIs [CharMod]. | |||
| 6.4 Use of UTF-8 for Encoding Original Characters | 6.4 Use of UTF-8 for Encoding Original Characters | |||
| This section discusses details and gives examples for point c) in | This section discusses details and gives examples for point c) in | |||
| Section 1.2. In order to be able to use IRIs, the URI corresponding | Section 1.2. In order to be able to use IRIs, the URI corresponding | |||
| to the IRI in question has to encode original characters into octets | to the IRI in question has to encode original characters into octets | |||
| using UTF-8. This can be specified for all URIs of a URI scheme, or | using UTF-8. This can be specified for all URIs of a URI scheme, or | |||
| can apply to individual URIs for schemes that do not specify how to | can apply to individual URIs for schemes that do not specify how to | |||
| encode original characters. It can apply to the whole URI, or only | encode original characters. It can apply to the whole URI, or only | |||
| some part. | some part. For background information on encoding characters into | |||
| URIs, see also Section 2.5 of [RFCYYYY]. | ||||
| For new URI schemes, using UTF-8 is recommended in [RFC2718]. | For new URI schemes, using UTF-8 is recommended in [RFC2718]. | |||
| Examples where this is already used are the URN syntax [RFC2141], | Examples where this is already used are the URN syntax [RFC2141], | |||
| IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, | IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, | |||
| because the HTTP URL scheme does not specify how to encode original | because the HTTP URL scheme does not specify how to encode original | |||
| characters, only some HTTP URLs can have corresponding but different | characters, only some HTTP URLs can have corresponding but different | |||
| IRIs. | IRIs. | |||
| For example, for a document with a URI of | For example, for a document with a URI of | |||
| http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to | http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to | |||
| construct a corresponding IRI (in XML notation, see Section 1.4): | construct a corresponding IRI (in XML notation, see Section 1.4): | |||
| http://www.example.org/résumé.html (é stands for the | http://www.example.org/résumé.html (é stands for the | |||
| e-acute character, and %C3%A9 is the UTF-8 encoded and | e-acute character, and %C3%A9 is the UTF-8 encoded and | |||
| percent-encoded representation of that character). On the other hand, | percent-encoded representation of that character). On the other | |||
| for a document with a URI of http://www.example.org/r%E9sum%E9.html, | hand, for a document with a URI of | |||
| the percent-encoding octets cannot be converted to actual characters | http://www.example.org/r%E9sum%E9.html, the percent-encoding octets | |||
| in an IRI, because the percent-encoding is not based on UTF-8. | cannot be converted to actual characters in an IRI, because the | |||
| percent-encoding is not based on UTF-8. | ||||
| The requirement for the use of UTF-8 applies to all parts of a URI | The requirement for the use of UTF-8 applies to all parts of a URI | |||
| (with the potential exception of the ireg-name part, see Section | (with the potential exception of the ireg-name part, see Section | |||
| 3.1). However, it is possible that the capability of IRIs to | 3.1). However, it is possible that the capability of IRIs to | |||
| represent a wide range of characters directly is used just in some | represent a wide range of characters directly is used just in some | |||
| parts of the IRI (or IRI reference). The other parts of the IRI may | parts of the IRI (or IRI reference). The other parts of the IRI may | |||
| only contain ASCII characters, or they may not be based on UTF-8. | only contain US-ASCII characters, or they may not be based on UTF-8. | |||
| They may be based on another character encoding, or they may directly | They may be based on another character encoding, or they may directly | |||
| encode raw binary data (see also [RFC2397]). | encode raw binary data (see also [RFC2397]). | |||
| For example, it is possible to have a URI reference of | For example, it is possible to have a URI reference of | |||
| http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9, where the | http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9, where the | |||
| document name is encoded in iso-8859-1 based on server settings, but | document name is encoded in iso-8859-1 based on server settings, but | |||
| the fragment identifier is encoded in UTF-8 according to [XPointer]. | the fragment identifier is encoded in UTF-8 according to [XPointer]. | |||
| The IRI corresponding to the above URI would be (in XML notation) | The IRI corresponding to the above URI would be (in XML notation) | |||
| http://www.example.org/r%E9sum%E9.xml#résumé. | http://www.example.org/r%E9sum%E9.xml#résumé. | |||
| skipping to change at page 27, line 10 | skipping to change at page 28, line 14 | |||
| In case the current handling in an API or protocol is based on | In case the current handling in an API or protocol is based on | |||
| US-ASCII, UTF-8 is recommended as the character encoding for IRIs, | US-ASCII, UTF-8 is recommended as the character encoding for IRIs, | |||
| because this is compatible with US-ASCII, is in accordance with the | because this is compatible with US-ASCII, is in accordance with the | |||
| recommendations of [RFC2277], and makes it easy to convert to URIs | recommendations of [RFC2277], and makes it easy to convert to URIs | |||
| where necessary. In any case, the API or protocol definition must | where necessary. In any case, the API or protocol definition must | |||
| clearly define the character encoding to be used. | clearly define the character encoding to be used. | |||
| The transfer from URI-only to IRI-capable components requires no | The transfer from URI-only to IRI-capable components requires no | |||
| mapping, although the conversion described in Section 3.2 above may | mapping, although the conversion described in Section 3.2 above may | |||
| be performed. It is preferable not to perform this inverse conversion | be performed. It is preferable not to perform this inverse | |||
| when there is a chance that this cannot be done correctly. | conversion when there is a chance that this cannot be done correctly. | |||
| 7.2 URI/IRI Entry | 7.2 URI/IRI Entry | |||
| There are components that allow users to enter URIs into the system, | There are components that allow users to enter URIs into the system, | |||
| for example by typing or dictation. This software must be updated to | for example by typing or dictation. This software must be updated to | |||
| allow for IRI entry. | allow for IRI entry. | |||
| A person viewing a visual representation of an IRI (as a sequence of | A person viewing a visual representation of an IRI (as a sequence of | |||
| glyphs, in some order, in some visual display) or hearing an IRI, | glyphs, in some order, in some visual display) or hearing an IRI, | |||
| will use a entry method for characters in the user's language to | will use a entry method for characters in the user's language to | |||
| skipping to change at page 27, line 36 | skipping to change at page 28, line 40 | |||
| restrictions defined in Section 2.2 are met. This may be done by | restrictions defined in Section 2.2 are met. This may be done by | |||
| choosing appropriate input methods or variants/settings thereof, by | choosing appropriate input methods or variants/settings thereof, by | |||
| appropriately converting the characters being input, by eliminating | appropriately converting the characters being input, by eliminating | |||
| characters that cannot be converted, and/or by issuing a warning or | characters that cannot be converted, and/or by issuing a warning or | |||
| error message to the user. | error message to the user. | |||
| As an example of variant settings, input method editors for East | As an example of variant settings, input method editors for East | |||
| Asian Languages usually allow the input of Latin letters and related | Asian Languages usually allow the input of Latin letters and related | |||
| characters in full-width or half-width versions. For IRI input, the | characters in full-width or half-width versions. For IRI input, the | |||
| input method editor should be set so that it produces half-width | input method editor should be set so that it produces half-width | |||
| Latin letters, and full-width Katakana. | Latin letters and punctuation, and full-width Katakana. | |||
| An input field primarily or only used for the input of URIs/IRIs may | An input field primarily or only used for the input of URIs/IRIs may | |||
| allow the user to view an IRI as mapped to a URI. Places where the | allow the user to view an IRI as mapped to a URI. Places where the | |||
| input of IRIs is frequent may provide the possibility for viewing an | input of IRIs is frequent may provide the possibility for viewing an | |||
| IRI as mapped to a URI. This will help users when some of the | IRI as mapped to a URI. This will help users when some of the | |||
| software they use does not yet accept IRIs. | software they use does not yet accept IRIs. | |||
| An IRI input component that interfaces to components that handle | An IRI input component that interfaces to components that handle | |||
| URIs, but not IRIs, must map the IRI to a URI before passing it to | URIs, but not IRIs, must map the IRI to a URI before passing it to | |||
| such a component. | such a component. | |||
| skipping to change at page 29, line 13 | skipping to change at page 30, line 13 | |||
| servers, similar considerations apply, see in particular [RFC2640]. | servers, similar considerations apply, see in particular [RFC2640]. | |||
| 7.5 URI/IRI Selection | 7.5 URI/IRI Selection | |||
| In some cases, resource owners and publishers have control over the | In some cases, resource owners and publishers have control over the | |||
| IRIs used to identify their resources. Such control is mostly | IRIs used to identify their resources. Such control is mostly | |||
| executed by controlling the resource names, such as file names, | executed by controlling the resource names, such as file names, | |||
| directly. | directly. | |||
| In such cases, it is recommended to avoid choosing IRIs that are | In such cases, it is recommended to avoid choosing IRIs that are | |||
| easily confused. For example, for US-ASCII, the lower-case ell "l" is | easily confused. For example, for US-ASCII, the lower-case ell "l" | |||
| easily confused with the digit one "1", and the upper-case oh "O" is | is easily confused with the digit one "1", and the upper-case oh "O" | |||
| easily confused with the digit zero "0". Publishers should avoid | is easily confused with the digit zero "0". Publishers should avoid | |||
| confusing users with "br0ken" or "1ame" identifiers. | confusing users with "br0ken" or "1ame" identifiers. | |||
| Outside of the US-ASCII range, there are many more opportunities for | Outside of the US-ASCII repertoire, there are many more opportunities | |||
| confusion; a complete set of guidelines is too lengthy to include | for confusion; a complete set of guidelines is too lengthy to include | |||
| here. As long as names are limited to characters from a single | here. As long as names are limited to characters from a single | |||
| script, native writers of a given script or language will know best | script, native writers of a given script or language will know best | |||
| when ambiguities can appear, and how they can be avoided. What may | when ambiguities can appear, and how they can be avoided. What may | |||
| look ambiguous to a stranger may be completely obvious to the average | look ambiguous to a stranger may be completely obvious to the average | |||
| native user. On the other hand, in some cases, the UCS contains | native user. On the other hand, in some cases, the UCS contains | |||
| variants for compatibility reasons, for example for typographic | variants for compatibility reasons, for example for typographic | |||
| purposes. These should be avoided wherever possible. Although there | purposes. These should be avoided wherever possible. Although there | |||
| may be exceptions, in general newly created resource names should be | may be exceptions, in general newly created resource names should be | |||
| in NFKC [UTR15] (which means that they are also in NFC). | in NFKC [UTR15] (which means that they are also in NFC). | |||
| skipping to change at page 30, line 28 | skipping to change at page 31, line 28 | |||
| encodings than UTF-8. Such URIs may be produced by user agents that | encodings than UTF-8. Such URIs may be produced by user agents that | |||
| do not conform to this specification and use legacy character | do not conform to this specification and use legacy character | |||
| encodings to convert non-ASCII characters to URIs. Whether this is | encodings to convert non-ASCII characters to URIs. Whether this is | |||
| necessary and what character encodings to cover, depends on a number | necessary and what character encodings to cover, depends on a number | |||
| of factors, such as the legacy character encodings used locally and | of factors, such as the legacy character encodings used locally and | |||
| the distribution of various versions of user agents. For example, | the distribution of various versions of user agents. For example, | |||
| software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in | software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in | |||
| addition to UTF-8. | addition to UTF-8. | |||
| Third, it may include additional mappings to be more user-friendly | Third, it may include additional mappings to be more user-friendly | |||
| and robust against transmission errors. These would be similar to how | and robust against transmission errors. These would be similar to | |||
| currently some servers treat URIs as case-insensitive, or perform | how currently some servers treat URIs as case-insensitive, or perform | |||
| additional matching to account for spelling errors. For characters | additional matching to account for spelling errors. For characters | |||
| beyond the ASCII repertoire, this may for example include ignoring | beyond the US-ASCII repertoire, this may for example include ignoring | |||
| the accents on received IRIs or resource names where appropriate. | the accents on received IRIs or resource names where appropriate. | |||
| Please note that such mappings, including case mappings, are | Please note that such mappings, including case mappings, are | |||
| language-dependent. | language-dependent. | |||
| It can be difficult to unambiguously identify a resource if too many | It can be difficult to unambiguously identify a resource if too many | |||
| mappings are taken into consideration. However, percent-encoded and | mappings are taken into consideration. However, percent-encoded and | |||
| not percent-encoded parts of IRIs can always clearly be | not percent-encoded parts of IRIs can always clearly be | |||
| distinguished. Also, the regularity of UTF-8 (see [Duerst97]) makes | distinguished. Also, the regularity of UTF-8 (see [Duerst97]) makes | |||
| the potential for collisions lower than it may seem at first sight. | the potential for collisions lower than it may seem at first sight. | |||
| skipping to change at page 31, line 21 | skipping to change at page 32, line 21 | |||
| individual IRI, care should be taken to upgrade the corresponding | individual IRI, care should be taken to upgrade the corresponding | |||
| interpreting software in order to cover the forms expected to be | interpreting software in order to cover the forms expected to be | |||
| received by various versions of entry and transport software. | received by various versions of entry and transport software. | |||
| The upgrade of generating software to generate IRIs instead of using | The upgrade of generating software to generate IRIs instead of using | |||
| a local character encoding should happen only after the service is | a local character encoding should happen only after the service is | |||
| upgraded to accept IRIs. Similarly, IRIs should only be generated | upgraded to accept IRIs. Similarly, IRIs should only be generated | |||
| when the service accepts IRIs and the intervening infrastructure and | when the service accepts IRIs and the intervening infrastructure and | |||
| protocol is known to transport them safely. | protocol is known to transport them safely. | |||
| Display software should be upgraded only after upgraded entry | Software converting from URIs to IRIs for display should be upgraded | |||
| software has been widely deployed to the population that will see the | only after upgraded entry software has been widely deployed to the | |||
| displayed result. | population that will see the displayed result. | |||
| It is often possible to reduce the effort and dependencies for | It is often possible to reduce the effort and dependencies for | |||
| upgrading to IRIs by using UTF-8 rather than another character | upgrading to IRIs by using UTF-8 rather than another character | |||
| encoding where there is a free choice of character encodings. For | encoding where there is a free choice of character encodings. For | |||
| example, when setting up a new file-based Web server, using UTF-8 as | example, when setting up a new file-based Web server, using UTF-8 as | |||
| the character encoding for file names will make the transition to | the character encoding for file names will make the transition to | |||
| IRIs easier. Likewise, when setting up a new Web form using UTF-8 as | IRIs easier. Likewise, when setting up a new Web form using UTF-8 as | |||
| the character encoding of the form page, the returned query URIs will | the character encoding of the form page, the returned query URIs will | |||
| use UTF-8 as the character encoding (unless the user, for whatever | use UTF-8 as the character encoding (unless the user, for whatever | |||
| reason, changes the character encoding) and will therefore be | reason, changes the character encoding) and will therefore be | |||
| compatible with IRIs. | compatible with IRIs. | |||
| These recommendations, when taken together, will allow for the | These recommendations, when taken together, will allow for the | |||
| extension from URIs to IRIs in order to handle scripts other than | extension from URIs to IRIs in order to handle characters other than | |||
| ASCII while minimizing interoperability problems. | US-ASCII while minimizing interoperability problems. | |||
| 8. Security Considerations | 8. Security Considerations | |||
| The security considerations discussed in [RFCYYYY] also apply to | The security considerations discussed in [RFCYYYY] also apply to | |||
| IRIs. In addition, the following issues require particular care for | IRIs. In addition, the following issues require particular care for | |||
| IRIs. | IRIs. | |||
| Incorrect encoding or decoding can lead to security problems. In | Incorrect encoding or decoding can lead to security problems. In | |||
| particular, some UTF-8 decoders do not check against overlong byte | particular, some UTF-8 decoders do not check against overlong byte | |||
| sequences. As an example, a '/' is encoded with the byte 0x2F both in | sequences. As an example, a '/' is encoded with the byte 0x2F both | |||
| UTF-8 and in ASCII, but some UTF-8 decoders also wrongly interpret | in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly | |||
| the sequence 0xC0 0xAF as a '/'. A sequence such as '%C0%AF..' may | interpret the sequence 0xC0 0xAF as a '/'. A sequence such as | |||
| pass some security tests and then be interpreted as '/..' in a path | '%C0%AF..' may pass some security tests and then be interpreted as '/ | |||
| if UTF-8 decoders are fault-tolerant, if conversion and checking are | ..' in a path if UTF-8 decoders are fault-tolerant, if conversion and | |||
| not done in the right order, and/or if reserved characters and | checking are not done in the right order, and/or if reserved | |||
| unreserved characters are not clearly distinguished. | characters and unreserved characters are not clearly distinguished. | |||
| There are various ways in which "spoofing" can occur with IRIs. | There are various ways in which "spoofing" can occur with IRIs. | |||
| "Spoofing" means that somebody may add a resource name that looks the | "Spoofing" means that somebody may add a resource name that looks the | |||
| same or similar to the user, but points to a different resource. The | same or similar to the user, but points to a different resource. The | |||
| added resource may pretend to be the real resource by looking very | added resource may pretend to be the real resource by looking very | |||
| similar, but may contain all kinds of changes that may be difficult | similar, but may contain all kinds of changes that may be difficult | |||
| to spot and can cause all kinds of problems. Most spoofing | to spot and can cause all kinds of problems. Most spoofing | |||
| possibilities for IRIs are extensions of those for URIs. | possibilities for IRIs are extensions of those for URIs. | |||
| Spoofing can occur for various reasons. A first reason is that | Spoofing can occur for various reasons. A first reason is that | |||
| normalization expectations of a user or actual normalization when | normalization expectations of a user or actual normalization when | |||
| entering an IRI, or when transcoding an IRI from a legacy character | entering an IRI, or when transcoding an IRI from a legacy character | |||
| encoding, do not match the normalization used on the server side. | encoding, do not match the normalization used on the server side. | |||
| Conceptually, this is no different from the problems surrounding the | Conceptually, this is no different from the problems surrounding the | |||
| use of case-insensitive web servers. For example, a popular web page | use of case-insensitive web servers. For example, a popular web page | |||
| with a mixed case name (http://big.site/PopularPage.html) might be | with a mixed case name (http://big.example.com/PopularPage.html) | |||
| "spoofed" by someone who is able to create http://big.site/ | might be "spoofed" by someone who is able to create | |||
| popularpage.html. However, the use of unnormalized character | http://big.example.com/popularpage.html. However, the use of | |||
| sequences, and of additional mappings for user convenience, may | unnormalized character sequences, and of additional mappings for user | |||
| increase the chance for spoofing. Protocols and servers that allow | convenience, may increase the chance for spoofing. Protocols and | |||
| the creation of resources with unnormalized names, and resources with | servers that allow the creation of resources with names that are not | |||
| names that are not normalized, are particularly vulnerable to such | normalized are particularly vulnerable to such attacks. This is an | |||
| attacks. This is an inherent security problem of the relevant | inherent security problem of the relevant protocol, server, or | |||
| protocol, server, or resource, and not specific to IRIs, but | resource, and not specific to IRIs, but mentioned here for | |||
| mentioned here for completeness. | completeness. | |||
| Spoofing can occur in various IRI components, such as the domain name | Spoofing can occur in various IRI components, such as the domain name | |||
| part or a path part. For considerations specific to the domain name | part or a path part. For considerations specific to the domain name | |||
| part, see [RFC3491]. For the path part, administrators of sites which | part, see [RFC3491]. For the path part, administrators of sites | |||
| allow independent users to create resources in the same subarea may | which allow independent users to create resources in the same subarea | |||
| need to be careful to check for spoofing. | may need to be careful to check for spoofing. | |||
| Spoofing can occur because in the UCS, there are many characters that | Spoofing can occur because in the UCS, there are many characters that | |||
| look very similar. Details are discussed in Section 7.5. Again, this | look very similar. Details are discussed in Section 7.5. Again, | |||
| is very similar to spoofing possibilities on US-ASCII, e.g. using | this is very similar to spoofing possibilities on US-ASCII, e.g. | |||
| 'br0ken' or '1ame' URIs. | using 'br0ken' or '1ame' URIs. | |||
| Spoofing can occur when URIs with percent-encodings based on various | Spoofing can occur when URIs with percent-encodings based on various | |||
| character encodings are accepted to deal with older user agents. In | character encodings are accepted to deal with older user agents. In | |||
| some cases, in particular for Latin-based resource names, this is | some cases, in particular for Latin-based resource names, this is | |||
| usually easy to detect because UTF-8-encoded names, when interpreted | usually easy to detect because UTF-8-encoded names, when interpreted | |||
| and viewed as legacy character encodings, produce mostly garbage. In | and viewed as legacy character encodings, produce mostly garbage. In | |||
| other cases, when concurrently used character encodings have a | other cases, when concurrently used character encodings have a | |||
| similar structure, but there are no characters that have exactly the | similar structure, but there are no characters that have exactly the | |||
| same encoding, detection is more difficult. | same encoding, detection is more difficult. | |||
| Spoofing can occur with bidirectional IRIs, if the restrictions in | Spoofing can occur with bidirectional IRIs, if the restrictions in | |||
| Section 4.2 are not followed. The same visual representation may be | Section 4.2 are not followed. The same visual representation may be | |||
| interpreted as different logical representations, and vice versa. It | interpreted as different logical representations, and vice versa. It | |||
| is also very important that a correct Unicode bidirectional | is also very important that a correct Unicode bidirectional | |||
| implementation is used. | implementation is used. | |||
| 9. Acknowledgements | 9. IANA Considerations | |||
| This document has no actions for IANA. | ||||
| 10. Acknowledgements | ||||
| We would like to thank Larry Masinter for his work as coauthor of | We would like to thank Larry Masinter for his work as coauthor of | |||
| many earlier versions of this document (draft-masinter-url-i18n-xx). | many earlier versions of this document (draft-masinter-url-i18n-xx). | |||
| The discussion on the issue addressed here has started a long time | The discussion on the issue addressed here has started a long time | |||
| ago. There was a thread in the HTML working group in August 1995 | ago. There was a thread in the HTML working group in August 1995 | |||
| (under the topic of "Globalizing URIs") and in the www-international | (under the topic of "Globalizing URIs") and in the www-international | |||
| mailing list in July 1996 (under the topic of "Internationalization | mailing list in July 1996 (under the topic of "Internationalization | |||
| and URLs"), and ad-hoc meetings at the Unicode conferences in | and URLs"), and ad-hoc meetings at the Unicode conferences in | |||
| September 1995 and September 1997. | September 1995 and September 1997. | |||
| Many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding, | Many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding, | |||
| Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim | Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim | |||
| Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie | Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie | |||
| Daigle, Ted Hardie, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex | Daigle, Ted Hardie, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex | |||
| Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam | Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam | |||
| Costello, Dan Oscarson, Elliotte Rusty Harold, Mike J. Brown, Andrea | Costello, Dan Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy | |||
| Vine, Roy Badami, Jonathan Rosenne, Asmus Freytag, Simon Josefsson, | Badami, Jonathan Rosenne, Asmus Freytag, Simon Josefsson, Carlos | |||
| Carlos Viegas Damasio, Chris Haynes, Walter Underwood, and many | Viegas Damasio, Chris Haynes, Walter Underwood, and many others for | |||
| others for help with understanding the issues and possible solutions, | help with understanding the issues and possible solutions, and | |||
| and getting the details right. Thanks also to the members of the W3C | getting the details right. | |||
| I18N Working Group and Interest Group for their contributions and | ||||
| their work on [CharMod], to the members of many other W3C WGs for | ||||
| adopting IRIs, and to the members of the Montreal IAB Workshop on | ||||
| Internationalization and Localization for their review. | ||||
| 10. References | This document is a product of the Internationalization Working Group | |||
| (I18N WG) of the World Wide Web Consortium (W3C). Thanks to the | ||||
| members of the W3C I18N Working Group and Interest Group for their | ||||
| contributions and their work on [CharMod]. Thanks also go to the | ||||
| members of many other W3C Working Groups for adopting IRIs, and to | ||||
| the members of the Montreal IAB Workshop on Internationalization and | ||||
| Localization for their review. | ||||
| 10.1 Normative References | 11. References | |||
| 11.1 Normative References | ||||
| [ASCII] American National Standards Institute, "Coded Character | ||||
| Set -- 7-bit American Standard Code for Information | ||||
| Interchange", ANSI X3.4, 1986. | ||||
| [ISO10646] | [ISO10646] | |||
| International Organization for Standardization, | International Organization for Standardization, "ISO/IEC | |||
| "Information Technology - Universal Multiple-Octet Coded | 10646:2003: Information Technology - Universal | |||
| Character Set (UCS) - Part 1: Architecture and Basic | Multiple-Octet Coded Character Set (UCS)", ISO Standard | |||
| Multilingual Plane - Part 2: Supplementary Planes", ISO | 10646, December 2003. | |||
| Standard 10646, with amendment, July 2002. | ||||
| [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | ||||
| Requirement Levels", BCP 14, RFC 2119, March 1997. | ||||
| [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | |||
| Specifications: ABNF", RFC 2234, November 1997. | Specifications: ABNF", RFC 2234, November 1997. | |||
| [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, | [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, | |||
| "Internationalizing Domain Names in Applications (IDNA)", | "Internationalizing Domain Names in Applications (IDNA)", | |||
| RFC 3490, March 2003, <http://www.ietf.org/rfc/ | RFC 3490, March 2003. | |||
| rfc3490.txt>. | ||||
| [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | |||
| Profile for Internationalized Domain Names (IDN)", RFC | Profile for Internationalized Domain Names (IDN)", RFC | |||
| 3491, March 2003. | 3491, March 2003. | |||
| [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO | [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO | |||
| 10646", STD 63, RFC 3629, November 2003, <http:// | 10646", STD 63, RFC 3629, November 2003. | |||
| www.ietf.org/rfc/rfc3629.txt>. | ||||
| [RFCYYYY] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform | [RFCYYYY] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform | |||
| Resource Identifier (URI): Generic Syntax", | Resource Identifier (URI): Generic Syntax (Note to the RFC | |||
| draft-fielding-uri-rfc2396bis-03.txt (work in progress), | Editor: Please update this reference with the RFC | |||
| June 2003. | resulting from draft-fielding-uri-rfc2396bis-xx.txt, and | |||
| remove this Note)", draft-fielding-uri-rfc2396bis-05.txt | ||||
| (work in progress), April 2004. | ||||
| [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard | ||||
| Annex #9, March 2004, | ||||
| <http://www.unicode.org/reports/tr9/tr9-13.html>. | ||||
| [UNIV4] The Unicode Consortium, "The Unicode Standard, Version | ||||
| 4.0.1, defined by: The Unicode Standard, Version 4.0 | ||||
| (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1), | ||||
| as amended by Unicode 4.0.1 | ||||
| (http://www.unicode.org/versions/Unicode4.0.1/)", March | ||||
| 2004. | ||||
| [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | |||
| Unicode Standard Annex #15, March 2001, <http:// | Unicode Standard Annex #15, April 2003, | |||
| www.unicode.org/unicode/reports/tr15/tr15-21.html>. | <http://www.unicode.org/unicode/reports/tr15/tr15-23.html>. | |||
| 10.2 Non-normative References | 11.2 Non-normative References | |||
| [BidiEx] "Examples of bidirectional IRIs", <http://www.w3.org/ | [BidiEx] "Examples of bidirectional IRIs", | |||
| International/iri-edit/BidiExamples>. | <http://www.w3.org/International/iri-edit/BidiExamples>. | |||
| [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M. and T. | [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M. and T. | |||
| Texin, "Character Model for the World Wide Web", World | Texin, "Character Model for the World Wide Web", World | |||
| Wide Web Consortium Working Draft, August 2003, <http:// | Wide Web Consortium Working Draft, February 2004, <http:// | |||
| www.w3.org/TR/charmod>. | www.w3.org/TR/charmod>. | |||
| [Duerst01] | [Duerst01] | |||
| Duerst, M., "Internationalized Resource Identifiers: From | Duerst, M., "Internationalized Resource Identifiers: From | |||
| Specification to Testing", Proc. 19th International | Specification to Testing", Proc. 19th International | |||
| Unicode Conference, San Jose , September 2001, <http:// | Unicode Conference, San Jose , September 2001, | |||
| www.w3.org/2001/Talks/0912-IUC-IRI/paper.html>. | <http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html>. | |||
| [Duerst97] | [Duerst97] | |||
| Duerst, M., "The Properties and Promises of UTF-8", Proc. | Duerst, M., "The Properties and Promises of UTF-8", Proc. | |||
| 11th International Unicode Conference, San Jose , | 11th International Unicode Conference, San Jose , | |||
| September 1997, <http://www.ifi.unizh.ch/mml/mduerst/ | September 1997, | |||
| papers/PDF/IUC11-UTF-8.pdf>. | <http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf> | |||
| . | ||||
| [Gettys] Gettys, J., "URI Model Consequences", <http://www.w3.org/ | [Gettys] Gettys, J., "URI Model Consequences", | |||
| DesignIssues/ModelConsequences>. | <http://www.w3.org/DesignIssues/ModelConsequences>. | |||
| [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 | [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 | |||
| Specification", World Wide Web Consortium Recommendation, | Specification", World Wide Web Consortium Recommendation, | |||
| December 1999, <http://www.w3.org/TR/REC-html40/appendix/ | December 1999, | |||
| notes.html#h-B.2>. | <http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2> | |||
| . | ||||
| [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | ||||
| Requirement Levels", BCP 14, RFC 2119, March 1997. | ||||
| [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., | [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., | |||
| Atkinson, R., Crispin, M. and P. Svanberg, "The Report of | Atkinson, R., Crispin, M. and P. Svanberg, "The Report of | |||
| the IAB Character Set Workshop held 29 February - 1 March, | the IAB Character Set Workshop held 29 February - 1 March, | |||
| 1996", RFC 2130, April 1997. | 1996", RFC 2130, April 1997. | |||
| [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | |||
| [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997. | [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997. | |||
| skipping to change at page 35, line 41 | skipping to change at page 37, line 16 | |||
| [RFC2616] Fielding, R., Gettys, J., Mogul, J., Nielsen, H., | [RFC2616] Fielding, R., Gettys, J., Mogul, J., Nielsen, H., | |||
| Masinter, L., Leach, P. and T. Berners-Lee, "Hypertext | Masinter, L., Leach, P. and T. Berners-Lee, "Hypertext | |||
| Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. | Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. | |||
| [RFC2640] Curtin, B., "Internationalization of the File Transfer | [RFC2640] Curtin, B., "Internationalization of the File Transfer | |||
| Protocol", RFC 2640, July 1999. | Protocol", RFC 2640, July 1999. | |||
| [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke, | [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke, | |||
| "Guidelines for new URL Schemes", RFC 2718, November 1999. | "Guidelines for new URL Schemes", RFC 2718, November 1999. | |||
| [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard | ||||
| Annex #9, March 2002, <http://www.unicode.org/unicode/ | ||||
| reports/tr9>. | ||||
| [UNIV4] The Unicode Consortium, "The Unicode Standard, Version | ||||
| 4.0", Addison-Wesley, Reading, MA , 2003. | ||||
| [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other | [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other | |||
| Markup Languages", Unicode Technical Report #20, World | Markup Languages", Unicode Technical Report #20, World | |||
| Wide Web Consortium Note, February 2002, <http:// | Wide Web Consortium Note, February 2002, | |||
| www.w3.org/TR/unicode-xml/>. | <http://www.w3.org/TR/unicode-xml/>. | |||
| [W3CIRI] Duerst, M., "Internationalization - URIs and other | [W3CIRI] Duerst, M., "Internationalization - URIs and other | |||
| identifiers", World Wide Web Consortium Note, September | identifiers", September 2002, | |||
| 2002, <http://www.w3.org/International/ | <http://www.w3.org/International/O-URL-and-ident.html>. | |||
| O-URL-and-ident.html>. | ||||
| [XLink] DeRose, S., Maler, E. and D. Orchard, "XML Linking | [XLink] DeRose, S., Maler, E. and D. Orchard, "XML Linking | |||
| Language (XLink) Version 1.0", World Wide Web Consortium | Language (XLink) Version 1.0", World Wide Web Consortium | |||
| Recommendation, June 2001, <http://www.w3.org/TR/xlink/ | Recommendation, June 2001, | |||
| #link-locators>. | <http://www.w3.org/TR/xlink/#link-locators>. | |||
| [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C. and E. Maler, | [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E. and | |||
| "Extensible Markup Language (XML) 1.0 (Second Edition)", | F. Yergeau, "Extensible Markup Language (XML) 1.0 (Third | |||
| World Wide Web Consortium Recommendation, including | Edition)", World Wide Web Consortium Recommendation, | |||
| Erratum 26 at http://www.w3.org/XML/xml-V10-2e-errata#E26, | February 2004, | |||
| October 2000, <http://www.w3.org/TR/ | <http://www.w3.org/TR/REC-xml#sec-external-ent>. | |||
| REC-xml#sec-external-ent>. | ||||
| [XMLNamespace] | [XMLNamespace] | |||
| Bray, T., Hollander, D. and A. Layman, "Namespaces in | Bray, T., Hollander, D. and A. Layman, "Namespaces in | |||
| XML", World Wide Web Consortium Recommendation, January | XML", World Wide Web Consortium Recommendation, January | |||
| 1999, <http://www.w3.org/TR/REC-xml#sec-external-ent>. | 1999, <http://www.w3.org/TR/REC-xml-names>. | |||
| [XMLSchema] | [XMLSchema] | |||
| Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes", | Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes", | |||
| World Wide Web Consortium Recommendation, May 2001, | World Wide Web Consortium Recommendation, May 2001, | |||
| <http://www.w3.org/TR/xmlschema-2/#anyURI>. | <http://www.w3.org/TR/xmlschema-2/#anyURI>. | |||
| [XPointer] | [XPointer] | |||
| Grosso, P., Maler, E., Marsh, J. and N. Walsh, "XPointer | Grosso, P., Maler, E., Marsh, J. and N. Walsh, "XPointer | |||
| Framework", World Wide Web Consortium Recommendation, | Framework", World Wide Web Consortium Recommendation, | |||
| March 2003, <http://www.w3.org/TR/xptr-framework/ | March 2003, | |||
| #escaping>. | <http://www.w3.org/TR/xptr-framework/#escaping>. | |||
| Authors' Addresses | Authors' Addresses | |||
| Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever | Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever | |||
| possible, for example as "Dürst" in XML and HTML.) | possible, for example as "Dürst" in XML and HTML.) | |||
| World Wide Web Consortium | World Wide Web Consortium | |||
| 5322 Endo | 5322 Endo | |||
| Fujisawa, Kanagawa 252-8520 | Fujisawa, Kanagawa 252-8520 | |||
| Japan | Japan | |||
| skipping to change at page 37, line 30 | skipping to change at page 38, line 46 | |||
| Appendix A.1 New Scheme(s) | Appendix A.1 New Scheme(s) | |||
| Introducing new schemes (for example httpi:, ftpi:,...) or a new | Introducing new schemes (for example httpi:, ftpi:,...) or a new | |||
| metascheme (e.g. i:, leading to URI/IRI prefixes such as i:http:, | metascheme (e.g. i:, leading to URI/IRI prefixes such as i:http:, | |||
| i:ftp:,...) was proposed to make IRI-to-URI conversion | i:ftp:,...) was proposed to make IRI-to-URI conversion | |||
| scheme-dependent or to distinguish between percent-encodings | scheme-dependent or to distinguish between percent-encodings | |||
| resulting from IRI-to-URI conversion and percent-encodings from | resulting from IRI-to-URI conversion and percent-encodings from | |||
| legacy character encodings. | legacy character encodings. | |||
| New schemes are not needed to distinguish URIs from true IRIs (i.e. | New schemes are not needed to distinguish URIs from true IRIs (i.e. | |||
| IRIs that contain non-ASCII characters). The benefit of being able to | IRIs that contain non-ASCII characters). The benefit of being able | |||
| detect the origin of percent-encodings is marginal, also because | to detect the origin of percent-encodings is marginal, because UTF-8 | |||
| UTF-8 can be detected with very high reliably. Deploying new schemes | can be detected with very high reliability. Deploying new schemes is | |||
| is extremely hard. Not needing new schemes for IRIs makes deployment | extremely hard, so not requiring new schemes for IRIs makes | |||
| of IRIs vastly easier. Making conversion scheme-dependent is highly | deployment of IRIs vastly easier. Making conversion scheme-dependent | |||
| unadvisable. Using an uniform convention for conversion from IRIs to | is highly inadvisable, and would be encouraged by separate schemes | |||
| URIs makes IRI implementation orthogonal from the introduction of | for IRIs. Using an uniform convention for conversion from IRIs to | |||
| acual new schemes. | URIs makes IRI implementation orthogonal to the introduction of | |||
| actual new schemes. | ||||
| Appendix A.2 Other Character Encodings than UTF-8 | Appendix A.2 Other Character Encodings than UTF-8 | |||
| At an early stage, UTF-7 was considered as an alternative to UTF-8 | At an early stage, UTF-7 was considered as an alternative to UTF-8 | |||
| when converting IRIs to URIs. UTF-7 would not have needed | when converting IRIs to URIs. UTF-7 would not have needed | |||
| percent-encoding, and would in most cases have been shorter than | percent-encoding, and would in most cases have been shorter than | |||
| percent-encoded UTF-8. | percent-encoded UTF-8. | |||
| UTF-8 avoids a double layering and overloading of the use of the "+" | Using UTF-8 avoids a double layering and overloading of the use of | |||
| character. UTF-8 is fully compatible with US-ASCII, and has therefore | the "+" character. UTF-8 is fully compatible with US-ASCII, and has | |||
| been recommended by the IETF, and is being used widely, while UTF-7 | therefore been recommended by the IETF, and is being used widely, | |||
| has never been used much and is now clearly being discouraged. | while UTF-7 has never been used much and is now clearly being | |||
| discouraged. Requiring implementations to convert from UTF-8 to | ||||
| UTF-7 and back would be an additional implementation burden. | ||||
| Appendix A.3 New Encoding Convention | Appendix A.3 New Encoding Convention | |||
| Instead of using the existing percent-encoding convention of URIs, | Instead of using the existing percent-encoding convention of URIs, | |||
| which is based on octets, the idea was to create a new encoding | which is based on octets, the idea was to create a new encoding | |||
| convention, for example to use '%u' to introduce UCS code points. | convention, for example to use '%u' to introduce UCS code points. | |||
| Using the existing octet-based percent-encoding mechanism does not | Using the existing octet-based percent-encoding mechanism does not | |||
| need an upgrade of the URI syntax, and does not need corresponding | need an upgrade of the URI syntax, and does not need corresponding | |||
| server upgrades. | server upgrades. | |||
| Appendix A.4 Indicating Character Encodings in the URI/IRI | Appendix A.4 Indicating Character Encodings in the URI/IRI | |||
| Some proposals suggested indicating the character encodings used in | Some proposals suggested indicating the character encodings used in | |||
| an URI or IRI with some new syntactic convention in the URI itself, | an URI or IRI with some new syntactic convention in the URI itself, | |||
| similar to the 'charset' parameter for emails and Web pages. As an | similar to the 'charset' parameter for emails and Web pages. As an | |||
| example, the label in square brackets in http://www.example.org/ | example, the label in square brackets in | |||
| ros[iso-8859-1]é indicated that the following é had to be | http://www.example.org/ros[iso-8859-1]é indicated that the | |||
| interpreted as iso-8859-1. | following é had to be interpreted as iso-8859-1. | |||
| Using UTF-8 only does not need an upgrade to the URI syntax. It | Using UTF-8 only does not need an upgrade to the URI syntax. It | |||
| avoids potentially multiple labels that have to be copied correctly | avoids potentially multiple labels that have to be copied correctly | |||
| in all cases, even on the side of a bus or on a napkin, leading to | in all cases, even on the side of a bus or on a napkin, leading to | |||
| usability problems to the extent of being prohibitively annoying. | usability problems to the extent of being prohibitively annoying. | |||
| Using UTF-8 only also reduces transcoding errors and confusions. | Using UTF-8 only also reduces transcoding errors and confusions. | |||
| Intellectual Property Statement | Intellectual Property Statement | |||
| The IETF takes no position regarding the validity or scope of any | The IETF takes no position regarding the validity or scope of any | |||
| Intellectual Property Rights or other rights that might be claimed to | Intellectual Property Rights or other rights that might be claimed to | |||
| pertain to the implementation or use of the technology described in | pertain to the implementation or use of the technology described in | |||
| this document or the extent to which any license under such rights | this document or the extent to which any license under such rights | |||
| might or might not be available; nor does it represent that it has | might or might not be available; nor does it represent that it has | |||
| made any independent effort to identify any such rights. Information | made any independent effort to identify any such rights. Information | |||
| on the IETF's procedures with respect to rights in IETF Documents can | on the procedures with respect to rights in RFC documents can be | |||
| be found in BCP 78 and BCP 79. | found in BCP 78 and BCP 79. | |||
| Copies of IPR disclosures made to the IETF Secretariat and any | Copies of IPR disclosures made to the IETF Secretariat and any | |||
| assurances of licenses to be made available, or the result of an | assurances of licenses to be made available, or the result of an | |||
| attempt made to obtain a general license or permission for the use of | attempt made to obtain a general license or permission for the use of | |||
| such proprietary rights by implementers or users of this | such proprietary rights by implementers or users of this | |||
| specification can be obtained from the IETF on-line IPR repository at | specification can be obtained from the IETF on-line IPR repository at | |||
| http://www.ietf.org/ipr. | http://www.ietf.org/ipr. | |||
| The IETF invites any interested party to bring to its attention any | The IETF invites any interested party to bring to its attention any | |||
| copyrights, patents or patent applications, or other proprietary | copyrights, patents or patent applications, or other proprietary | |||
| End of changes. | ||||
This html diff was produced by rfcdiff 1.12, available from http://www.levkowetz.com/ietf/tools/rfcdiff/ | ||||