| draft-duerst-iri-00.txt | draft-duerst-iri-01.txt | |||
|---|---|---|---|---|
| Network Working Group M. Duerst | Network Working Group M. Duerst | |||
| Internet-Draft W3C/Keio University | Internet-Draft W3C/Keio University | |||
| Expires: October 16, 2002 M. Suignard | Expires: December 30, 2002 M. Suignard | |||
| Microsoft Corporation | Microsoft Corporation | |||
| April 17, 2002 | July 1, 2002 | |||
| Internationalized Resource Identifiers (IRI) | Internationalized Resource Identifiers (IRI) | |||
| draft-duerst-iri-00 | draft-duerst-iri-01 | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with | This document is an Internet-Draft and is in full conformance with | |||
| all provisions of Section 10 of RFC2026. | all provisions of Section 10 of RFC2026. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
| other groups may also distribute working documents as Internet- | other groups may also distribute working documents as Internet- | |||
| Drafts. | Drafts. | |||
| skipping to change at page 1, line 33 | skipping to change at page 1, line 33 | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at http:// | The list of current Internet-Drafts can be accessed at http:// | |||
| www.ietf.org/ietf/1id-abstracts.txt. | www.ietf.org/ietf/1id-abstracts.txt. | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
| This Internet-Draft will expire on October 16, 2002. | This Internet-Draft will expire on December 30, 2002. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (2002). All Rights Reserved. | Copyright (C) The Internet Society (2002). All Rights Reserved. | |||
| Abstract | Abstract | |||
| This document defines a new protocol element, the Internationalized | This document defines a new protocol element, the Internationalized | |||
| Resource Identifier (IRI), as a complement to the URI [RFC2396]. An | Resource Identifier (IRI), as a complement to the URI [RFC2396]. An | |||
| IRI is a sequence of characters from the Universal Character Set | IRI is a sequence of characters from the Universal Character Set | |||
| skipping to change at page 2, line 9 | skipping to change at page 2, line 9 | |||
| resources. | resources. | |||
| The approach of defining a new protocol element was chosen, instead | The approach of defining a new protocol element was chosen, instead | |||
| of extending or changing the definition of URIs, to allow a clear | of extending or changing the definition of URIs, to allow a clear | |||
| distinction and to avoid incompatibilities with existing software. | distinction and to avoid incompatibilities with existing software. | |||
| Guidelines for the use and deployment of IRIs in various protocols, | Guidelines for the use and deployment of IRIs in various protocols, | |||
| formats, and software components that now deal with URIs are | formats, and software components that now deal with URIs are | |||
| provided. | provided. | |||
| Section 1 introduces concepts, definitions, and the scope of this | ||||
| specification. Section 2 discusses the IRI syntax and conversion | ||||
| between IRIs and URIs. Section 3 deals with limitations on | ||||
| characters appropriate for use in IRIs, and with processing of IRIs. | ||||
| Section 4 discusses software requirements for IRIs from an | ||||
| operational viewpoint. | ||||
| NOTE | NOTE | |||
| This draft replaces draft-masinter-url-i18n-08.txt. This document is | This document is a product of the Internationalization Working Group | |||
| a product of the Internationalization Working Group (I18N WG) of the | (I18N WG) of the World Wide Web Consortium (W3C). For general | |||
| World Wide Web Consortium (W3C). For general discussion, please use | discussion, please use the www-i18n-comments@w3.org mailing list | |||
| the www-i18n-comments@w3.org mailing list (publicly archived at | (publicly archived at http://lists.w3.org/Archives/Public/www-i18n- | |||
| http://lists.w3.org/Archives/Public/www-i18n-comments/). For more | comments/). For more information on the topic of this document, | |||
| information on the topic of this document, please also see [W3CIRI] | please also see [W3CIRI] and [Duer01]. | |||
| and [Duer01]. | ||||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . . 4 | 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . . 4 | 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 5 | 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 2.1 Summary of IRI syntax . . . . . . . . . . . . . . . . . . . 6 | 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . . . 6 | |||
| 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . . 6 | 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . . . 6 | |||
| 2.3 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . . 8 | 2.3 IRI Equivalence and Normalization . . . . . . . . . . . . . . 9 | |||
| 2.3.1 When to convert from IRIs to URIs . . . . . . . . . . . . . 10 | 3. Relationship between IRIs and URIs . . . . . . . . . . . . . . 10 | |||
| 2.4 Converting URIs to IRIs . . . . . . . . . . . . . . . . . . 10 | 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . . . 11 | |||
| 3. Considerations for use of IRIs . . . . . . . . . . . . . . . 11 | 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . . . 12 | |||
| 3.1 IRI Character Limitations . . . . . . . . . . . . . . . . . 11 | 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . . 13 | |||
| 3.2 Bidirectional IRIs for right-to-left languages . . . . . . . 13 | 4.1 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . . 14 | |||
| 3.3 Processing IRIs . . . . . . . . . . . . . . . . . . . . . . 13 | 4.2 Visual Rendering of Bidi IRIs . . . . . . . . . . . . . . . . 14 | |||
| 4. Software requirements . . . . . . . . . . . . . . . . . . . 14 | 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . . 15 | |||
| 4.1 URI/IRI software interfaces . . . . . . . . . . . . . . . . 14 | 5. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 15 | |||
| 4.2 URI/IRI entry . . . . . . . . . . . . . . . . . . . . . . . 14 | 5.1 Limitations on UCS Character Allowed in IRI . . . . . . . . . 15 | |||
| 4.3 URI/IRI generation . . . . . . . . . . . . . . . . . . . . . 15 | 5.2 Software Interfaces and Protocols . . . . . . . . . . . . . . 16 | |||
| 4.4 URI/IRI selection . . . . . . . . . . . . . . . . . . . . . 16 | 5.3 Format of URIs and IRIs in Documents and Protocols . . . . . . 17 | |||
| 4.5 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . . 16 | 5.4 Relative IRI References . . . . . . . . . . . . . . . . . . . 17 | |||
| 4.6 Interpretation of URI/IRIs . . . . . . . . . . . . . . . . . 17 | 6. URI/IRI Processing Guidelines (informative) . . . . . . . . . 17 | |||
| 4.7 Transportation of URI/IRIs in document formats and protocols 18 | 6.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . . . 18 | |||
| 5. Upgrading strategy . . . . . . . . . . . . . . . . . . . . . 18 | 6.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . . . 18 | |||
| 6. Security considerations . . . . . . . . . . . . . . . . . . 19 | 6.3 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . . . 19 | |||
| 7. Acknowlegdements . . . . . . . . . . . . . . . . . . . . . . 20 | 6.4 URI/IRI Selection . . . . . . . . . . . . . . . . . . . . . . 19 | |||
| References . . . . . . . . . . . . . . . . . . . . . . . . . 20 | 6.5 Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . . . 20 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 23 | 6.6 Interpretation of URIs and IRIs . . . . . . . . . . . . . . . 20 | |||
| Full Copyright Statement . . . . . . . . . . . . . . . . . . 24 | 6.7 Upgrading Strategy . . . . . . . . . . . . . . . . . . . . . . 21 | |||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . . 21 | ||||
| 8. Change log . . . . . . . . . . . . . . . . . . . . . . . . . . 22 | ||||
| 9. Acknowlegdements . . . . . . . . . . . . . . . . . . . . . . . 23 | ||||
| References . . . . . . . . . . . . . . . . . . . . . . . . . . 23 | ||||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 26 | ||||
| Full Copyright Statement . . . . . . . . . . . . . . . . . . . 27 | ||||
| 1. Introduction | 1. Introduction | |||
| 1.1 Overview and Motivation | 1.1 Overview and Motivation | |||
| A URI is defined in [RFC2396] as a sequence of characters chosen from | A URI is defined in [RFC2396] as a sequence of characters chosen from | |||
| a limited subset of the repertoire of US-ASCII characters. | a limited subset of the repertoire of US-ASCII characters. | |||
| The characters in URIs are frequently used for representing words of | The characters in URIs are frequently used for representing words of | |||
| natural languages. Such usage has many advantages: such URIs are | natural languages. Such usage has many advantages: such URIs are | |||
| skipping to change at page 4, line 40 | skipping to change at page 4, line 40 | |||
| This document defines a new protocol element, called IRI | This document defines a new protocol element, called IRI | |||
| (Internationalized Resource Identifier), by extending the syntax of | (Internationalized Resource Identifier), by extending the syntax of | |||
| URIs to a much wider repertoire of characters. It also defines | URIs to a much wider repertoire of characters. It also defines | |||
| "internationalized" versions corresponding to other constructs from | "internationalized" versions corresponding to other constructs from | |||
| [RFC2396], such as URI references. | [RFC2396], such as URI references. | |||
| Using characters outside of A-Z in IRIs brings with it some | Using characters outside of A-Z in IRIs brings with it some | |||
| difficulties; a discussion of potential problems and workarounds can | difficulties; a discussion of potential problems and workarounds can | |||
| be found in the later sections of this document. | be found in the later sections of this document. | |||
| URIs often contain Internet host names embedded within them. There | ||||
| is an ongoing discussion of internationalization and host names; the | ||||
| specific issues of the relationship of IRIs and possible future | ||||
| "internationalized" host names are not discussed here. (See [IDN- | ||||
| URI] for a separate proposal.) | ||||
| 1.2 Applicability | 1.2 Applicability | |||
| IRIs are designed to be compatible with recent recommendations on URI | IRIs are designed to be compatible with recent recommendations on URI | |||
| syntax [RFC2718]. Practical use of IRIs (or IRI references) in place | syntax [RFC2718]. The compatibility is provided by providing a well | |||
| of URIs (or URI references) depends on the following conditions being | defined and deterministic mapping from the IRI character sequence to | |||
| met: | the functionally equivalent URI character sequence. Practical use of | |||
| IRIs (or IRI references) in place of URIs (or URI references) depends | ||||
| on the following conditions being met: | ||||
| a. The protocol or format element used should be explicitly | a. The protocol or format element used should be explicitly | |||
| designated to carry IRIs. That is, the intent is not to | designated to carry IRIs. That is, the intent is not to | |||
| introduce IRIs into contexts that are not defined to accept | introduce IRIs into contexts that are not defined to accept | |||
| them. For examlpe, XML schema [XMLSchema] has an explicit type | them. For example, XML schema [XMLSchema] has an explicit type | |||
| "anyURI" that can be used to designate IRIs. | "anyURI" that designates the use of IRIs. | |||
| b. The protocol or format carrying the IRIs must have a mechanism | b. The protocol or format carrying the IRIs must have a mechanism | |||
| to represent the wide range of characters used in IRIs, either | to represent the wide range of characters used in IRIs, either | |||
| natively or by some protocol- or format-specific escaping | natively or by some protocol- or format-specific escaping | |||
| mechanism (for example numeric character references in [XML1]). | mechanism (for example numeric character references in [XML1]). | |||
| c. Either by definition for all the URIs of a specific URI | c. Either by definition for all the URIs of a specific URI | |||
| scheme, or at least for some specific URIs, the encoding of | scheme, or at least for some specific URIs, the encoding of | |||
| non-ASCII characters has to be based on UTF-8. For new URI | non-ASCII characters has to be based on UTF-8. For new URI | |||
| schemes, this is recommended in [RFC2718]. This allows IRIs to | schemes, this is recommended in [RFC2718]. This allows IRIs to | |||
| skipping to change at page 5, line 33 | skipping to change at page 5, line 28 | |||
| a piece of a URI (reference), such as the fragment identifier. | a piece of a URI (reference), such as the fragment identifier. | |||
| In cases and for pieces where an encoding other than UTF-8 is used, | In cases and for pieces where an encoding other than UTF-8 is used, | |||
| and for raw binary data encoded in URIs (see [RFC2397]), the octets | and for raw binary data encoded in URIs (see [RFC2397]), the octets | |||
| have to be %-escaped. In these situations, the ability of IRIs to | have to be %-escaped. In these situations, the ability of IRIs to | |||
| directly represent a wide character repertoire cannot be used. | directly represent a wide character repertoire cannot be used. | |||
| 1.3 Definitions | 1.3 Definitions | |||
| The following definitions are used in this document; they follow the | The following definitions are used in this document; they follow the | |||
| terms in [RFC2130] and [RFC2277]: | terms in [RFC2130], [RFC2277] and [ISO10646]: | |||
| character: An abstract object with a separate identity. For | character: A member of a set of elements used for the | |||
| example, "LATIN CAPITAL LETTER A" names a character. | organization, control, or representation of data. For example, | |||
| "LATIN CAPITAL LETTER A" names a character. | ||||
| octet: 8 bits | octet: an ordered sequence of eight bits considered as a unit | |||
| character repertoire: A set of characters (in the mathematical | character repertoire: A set of characters (in the mathematical | |||
| sense) | sense) | |||
| sequence of characters: A sequence (one after another) of | sequence of characters: A sequence (one after another) of | |||
| characters | characters | |||
| sequence of octets: A sequence (one after another) of octets | sequence of octets: A sequence (one after another) of octets | |||
| (character) encoding: A method of representing a sequence of | (character) encoding: A method of representing a sequence of | |||
| skipping to change at page 6, line 28 | skipping to change at page 6, line 24 | |||
| sequence of octets. This definition accommodates the fact that IRIs | sequence of octets. This definition accommodates the fact that IRIs | |||
| may be written on paper or read over the radio as well as being | may be written on paper or read over the radio as well as being | |||
| transmitted over the network. The same IRI may be represented as | transmitted over the network. The same IRI may be represented as | |||
| different sequences of octets in different protocols or documents if | different sequences of octets in different protocols or documents if | |||
| these protocols or documents use different character encodings and/or | these protocols or documents use different character encodings and/or | |||
| transfer encodings. Using the same character encoding as the | transfer encodings. Using the same character encoding as the | |||
| containing protocol or document assures that the characters in the | containing protocol or document assures that the characters in the | |||
| IRI can be handled (searched, converted, displayed,...) in the same | IRI can be handled (searched, converted, displayed,...) in the same | |||
| way as the rest of the protocol or document. | way as the rest of the protocol or document. | |||
| 2.1 Summary of IRI syntax | 2.1 Summary of IRI Syntax | |||
| IRIs are defined similarly to URIs in [RFC2396] (as modified by | IRIs are defined similarly to URIs in [RFC2396] (as modified by | |||
| [RFC2732]), but the class of unreserved characters is extended by | [RFC2732] and [IDNURI]), but the class of unreserved characters is | |||
| adding all the characters of the UCS (Universal Character Set, | extended by adding all the characters of the UCS (Universal Character | |||
| [ISO10646]) beyond U+0080, subject to the limitations given in | Set, [ISO10646]) beyond U+0080, subject to the limitations given in | |||
| Section 3. | Section 5.1. | |||
| Otherwise, the syntax and use of components and reserved characters | Otherwise, the syntax and use of components and reserved characters | |||
| is the same as that in [RFC2396]. All the operations defined in | is the same as that in [RFC2396]. All the operations defined in | |||
| [RFC2396], such as the resolution of relative URIs, can be applied to | [RFC2396], such as the resolution of relative URIs, can be applied to | |||
| IRIs by IRI-processing software in exactly the same way as this is | IRIs by IRI-processing software in exactly the same way as this is | |||
| done to URIs by URI-processing software. | done to URIs by URI-processing software. | |||
| Characters outside the US-ASCII range MUST NOT be used for | Characters outside the US-ASCII range MUST NOT be used for | |||
| syntactical purposes such as to delimit components in newly defined | syntactical purposes such as to delimit components in newly defined | |||
| schemes. As an example, it is not allowed to use U+00A2, CENT SIGN, | schemes. As an example, it is not allowed to use U+00A2, CENT SIGN, | |||
| skipping to change at page 7, line 9 | skipping to change at page 7, line 5 | |||
| is in the 'unreserved' category. | is in the 'unreserved' category. | |||
| 2.2 ABNF for IRI References and IRIs | 2.2 ABNF for IRI References and IRIs | |||
| While it might be possible to define IRI references and IRIs merely | While it might be possible to define IRI references and IRIs merely | |||
| by their transformation to URIs, they can also be accepted and | by their transformation to URIs, they can also be accepted and | |||
| processed directly. Therefore, an ABNF definition for IRI references | processed directly. Therefore, an ABNF definition for IRI references | |||
| (which are the most general concept and the start of the grammar) and | (which are the most general concept and the start of the grammar) and | |||
| IRIs is given here. | IRIs is given here. | |||
| The following rules are different form [RFC2396]: | The following rules are different from [RFC2396]: | |||
| IRI-reference = [ absoluteIRI | relativeIRI ] [ "#" ifragment ] | IRI-reference = [ absoluteIRI | relativeIRI ] [ "#" ifragment ] | |||
| absoluteIRI = scheme ":" ( ihier_part | iopaque_part ) | absoluteIRI = scheme ":" ( ihier_part | iopaque_part ) | |||
| relativeIRI = ( inet_path | iabs_path | irel_path ) | relativeIRI = ( inet_path | iabs_path | irel_path ) | |||
| [ "?" iquery ] | [ "?" iquery ] | |||
| ihier_part = ( inet_path | iabs_path ) [ "?" iquery ] | ihier_part = ( inet_path | iabs_path ) [ "?" iquery ] | |||
| iopaque_part = iric_no_slash *iric | iopaque_part = iric_no_slash *iric | |||
| iric_no_slash = iunreserved | escaped | ";" | "?" | ":" | "@" | | iric_no_slash = iunreserved | escaped | ";" | "?" | ":" | "@" | | |||
| "&" | "=" | "+" | "$" | "," | "&" | "=" | "+" | "$" | "," | |||
| inet_path = "//" iauthority [ iabs_path ] | inet_path = "//" iauthority [ iabs_path ] | |||
| iabs_path = "/" ipath_segments | iabs_path = "/" ipath_segments | |||
| irel_path = irel_segment [ iabs_path ] | irel_path = irel_segment [ iabs_path ] | |||
| irel_segment = 1*( iunreserved | escaped | | irel_segment = 1*( iunreserved | escaped | | |||
| ";" | "@" | "&" | "=" | "+" | "$" | "," ) | ";" | "@" | "&" | "=" | "+" | "$" | "," ) | |||
| iauthority = server | ireg_name | iauthority = iserver | ireg_name | |||
| ireg_name = 1*( iunreserved | escaped | "$" | "," | | ireg_name = 1*( iunreserved | escaped | "$" | "," | | |||
| ";" | ":" | "@" | "&" | "=" | "+" ) | ";" | ":" | "@" | "&" | "=" | "+" ) | |||
| iserver = [ [ userinfo "@" ] ihostport ] | ||||
| iuserinfo = *( iunreserved | escaped | | ||||
| ";" | ":" | "&" | "=" | "+" | "$" | "," ) | ||||
| ihostport = ihost [ ":" port ] | ||||
| ihost = ihostname | IPv4address | IPv6reference | ||||
| ihostname = << as specified by [IDNA] >> | ||||
| ipath_segments = isegment *( "/" isegment ) | ipath_segments = isegment *( "/" isegment ) | |||
| isegment = *ipchar *( ";" iparam ) | isegment = *ipchar *( ";" iparam ) | |||
| iparam = *ipchar | iparam = *ipchar | |||
| ipchar = iunreserved | escaped | | ipchar = iunreserved | escaped | | |||
| ":" | "@" | "&" | "=" | "+" | "$" | "," | ":" | "@" | "&" | "=" | "+" | "$" | "," | |||
| iquery = *iric | iquery = *iric | |||
| ifragment = *iric | ifragment = *iric | |||
| iric = reserved | iunreserved | escaped | iric = reserved | iunreserved | escaped | |||
| iunreserved = ichar | unreserved | iunreserved = ichar | unreserved | |||
| ichar = << character of the UCS [ISO10646] of beyond | ichar = << allowed character of the UCS [ISO10646] >> | space | delims | unwise | |||
| U+009F, subject to the limitations in | ||||
| Section 3.1. >> | space | delims | unwise | ||||
| Note that the space character and various delimiters are allowed in | Note that the space character and various delimiters are allowed in | |||
| IRIs and IRI references. This is further discussed in section 3.1, | IRIs and IRI references. This is further discussed in Section 5.1. | |||
| point b. | ||||
| The following describe the allowed characters of the UCS [ISO10646] | ||||
| using the UCS-4 encoding notation for these characters: | ||||
| U+00A0-U+D7FF | ||||
| U+F900-U+FDCF | ||||
| U+FDF0-U+FFEF | ||||
| U+10000-U+1FFFD | ||||
| U+20000-U+2FFFD | ||||
| U+30000-U+3FFFD | ||||
| U+40000-U+4FFFD | ||||
| U+50000-U+5FFFD | ||||
| U+60000-U+6FFFD | ||||
| U+70000-U+7FFFD | ||||
| U+80000-U+8FFFD | ||||
| U+90000-U+9FFFD | ||||
| U+A0000-U+AFFFD | ||||
| U+B0000-U+BFFFD | ||||
| U+C0000-U+CFFFD | ||||
| U+D0000-U+DFFFD | ||||
| U+E1000-U+EFFFD | ||||
| The following are the same as [RFC2396] as modified by [RFC2732]: | The following are the same as [RFC2396] as modified by [RFC2732]: | |||
| reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | | reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | | |||
| "$" | "," | "[" | "]" | "$" | "," | "[" | "]" | |||
| unreserved = alphanum | mark | unreserved = alphanum | mark | |||
| mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | | mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | | |||
| "(" | ")" | "(" | ")" | |||
| escaped = "%" HEXDIG HEXDIG | escaped = "%" hex hex | |||
| server = [ [ userinfo "@" ] hostport ] | hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | | |||
| userinfo = *( unreserved | escaped | | "a" | "b" | "c" | "d" | "e" | "f" | |||
| ";" | ":" | "&" | "=" | "+" | "$" | "," ) | ||||
| hostport = host [ ":" port ] | ||||
| host = hostname | IPv4address | IPv6reference | ||||
| IPv6reference = "[" IPv6address "]" | IPv6reference = "[" IPv6address "]" | |||
| hostname = *( domainlabel "." ) toplabel [ "." ] | ||||
| domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum | ||||
| toplabel = alpha | alpha *( alphanum | "-" ) alphanum | ||||
| IPv6address = hexpart [ ":" IPv4address ] | IPv6address = hexpart [ ":" IPv4address ] | |||
| IPv4address = 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT | IPv4address = 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT | |||
| hexpart = hexseq | hexseq "::" [ hexseq ] | "::" | hexpart = hexseq | hexseq "::" [ hexseq ] | "::" | |||
| [ hexseq ] | [ hexseq ] | |||
| hexseq = hex4 *( ":" hex4) | hexseq = hex4 *( ":" hex4) | |||
| hex4 = 1*4HEXDIG | hex4 = 1*4hex | |||
| port = *DIGIT | port = *DIGIT | |||
| scheme = alpha *( alpha | digit | "+" | "-" | "." ) | scheme = alpha *( alpha | digit | "+" | "-" | "." ) | |||
| alphanum = alpha | digit | alphanum = alpha | digit | |||
| alpha = lowalpha | upalpha | alpha = lowalpha | upalpha | |||
| lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | | lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | | |||
| "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | | |||
| "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" | |||
| upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | | upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | | |||
| "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | | |||
| "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | |||
| space = <US-ASCII coded character 20 hexadecimal> | digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | | |||
| "8" | "9" | ||||
| space = << US-ASCII coded character 20 hexadecimal >> | ||||
| delims = "<" | ">" | "#" | "%" | <"> | delims = "<" | ">" | "#" | "%" | <"> | |||
| unwise = "{" | "}" | "|" | "\" | "^" | "`" | unwise = "{" | "}" | "|" | "\" | "^" | "`" | |||
| 2.3 Mapping of IRIs to URIs | 2.3 IRI Equivalence and Normalization | |||
| There is no general rule or procedure to decide whether two arbitrary | ||||
| IRIs are equivalent or not (i.e. refer to the same resource or not). | ||||
| Two IRIs that look almost the same may refer to different resources. | ||||
| Two IRIs that look completely different may refer to, and resolve to, | ||||
| the same resource. | ||||
| In some scenarios, such as XML Namespaces ([XMLNamespace]), a | ||||
| definite answer to the question of IRI equivalence is needed that is | ||||
| independent of the scheme used and always can be calculated quickly | ||||
| and without accessing a network. In such cases, two IRIs SHOULD be | ||||
| defined as equivalent if and only if they are character-by-character | ||||
| equivalent (which is the same as byte-by-byte equivalent if the | ||||
| character encoding for both IRIs is the same). In such a case, the | ||||
| comparison function MUST NOT map the IRIs to URIs. | ||||
| It follows from the above that IRIs SHOULD NOT be modified when being | ||||
| transported. | ||||
| For actual resolution, differences in escaping (except for the | ||||
| escaping of reserved characters) MUST always result in the same | ||||
| resource. For example, foo://example.com/XML, foo://example.com/ | ||||
| XM%4C, and foo://example.com/XM%4c must resolve to the same resource. | ||||
| If this kind of equivalence is to be tested, the escaping of both | ||||
| IRIs to be compared has to be aligned, for example by converting both | ||||
| IRIs to URIs (see Section 3.1) and making sure that the case of the | ||||
| hexadecimal characters in the %-escape is always the same. Such | ||||
| conversions MUST only be done on the fly, without changing the | ||||
| original IRI. | ||||
| Specific schemes and resolution mechanisms may define additional | ||||
| equivalences. For a specific scheme, two IRIs that e.g. differ only | ||||
| by case may be equivalent. However, this document does not deal with | ||||
| scheme-specific issues. | ||||
| The Unicode Standard [UNIV3] defines various equivalences between | ||||
| sequences of characters for various purposes. Unicode Standard Annex | ||||
| #15 [UNI15] defines various Normalization Forms for these | ||||
| equivalences. IRIs SHOULD be created using the Normalization Form C | ||||
| (NFC). When an IRI is created in an UCS-based encoding without the | ||||
| end-user being aware of or interested in Unicode normalization | ||||
| issues, the IRI MUST be created using the normalization form NFC. | ||||
| Equivalence of IRIs MUST rely on the IRIs being appropriately pre- | ||||
| normalized, rather than applying normalization, except when | ||||
| converting from a non-UCS-based encoding to an UCS-based encoding, | ||||
| where a normalizing transcoder using NFC MUST be used. | ||||
| Various IRI schemes may allow the usage of International Domain Names | ||||
| (IDN) [IDNA]. When in use in IRIs, those names SHOULD be validated | ||||
| using the rules defined by [Nameprep]. An IRI containing an invalid | ||||
| IDN cannot successfully be resolved. For legibility purposes, IDN | ||||
| components of IRIs SHOULD not be converted into ASCII Compatible | ||||
| Encoding (ACE). However, this conversion may be applied when mapping | ||||
| an IRI into an URI, see Section 3.1. | ||||
| 3. Relationship between IRIs and URIs | ||||
| IRIs are meant to replace URIs in identifying resources for | ||||
| protocols, formats and software components which use a UCS-based | ||||
| character repertoire. These protocols and components may never need | ||||
| to use URIs directly, especially when the resource identifier is used | ||||
| simply for identification purposes. However, when the resource | ||||
| identifier is used for resource retrieval, it is in many cases | ||||
| necessary to determine the associated URI because most retrieval | ||||
| mechanisms currently only are defined for URIs. (Additional | ||||
| rationale is given in Section 3.1.) | ||||
| 3.1 Mapping of IRIs to URIs | ||||
| This section defines how to map an IRI to a URI. Everything in this | This section defines how to map an IRI to a URI. Everything in this | |||
| section applies also to IRI references and URI references, as well as | section applies also to IRI references and URI references, as well as | |||
| components thereof (for example fragment identifiers). | components thereof (for example fragment identifiers). | |||
| This mapping has two purposes: | This mapping has two purposes: | |||
| a) Syntactical: Many URI schemes and components define additional | a) Syntactical: Many URI schemes and components define additional | |||
| syntactical restrictions not captured in Section 2.2. Such | syntactical restrictions not captured in Section 2.2. Such | |||
| restrictions can be applied to IRIs by noting that IRIs are | restrictions can be applied to IRIs by noting that IRIs are | |||
| only valid if they map to syntactically valid URIs. This means | only valid if they map to syntactically valid URIs. This means | |||
| that such syntactical restrictions do not have to be defined | that such syntactical restrictions do not have to be defined | |||
| again on the IRI level. | again on the IRI level. | |||
| b) Interpretational: URIs identify resources in various ways. | b) Interpretational: URIs identify resources in various ways. | |||
| IRIs also identify resources. The resource that an IRI | IRIs also identify resources. When the IRI is used simply for | |||
| identifies is the same as the one identified by the URI | indentification purposes, it is not necessary to map the IRI to | |||
| obtained after converting the IRI according to the procedure | an URI (see Section 2.3). However, when an IRI is used for | |||
| defined here. This means that there is no need to define the | resource retrieval, the resource that the IRI locates is the | |||
| association between identifier and resource again on the IRI | same as the one located by the URI obtained after converting | |||
| the IRI according to the procedure defined here. This means | ||||
| that there is no need to define resolution again on the IRI | ||||
| level. | level. | |||
| This mapping is accomplished in two parts. Part A) is skipped if the | This mapping is accomplished in two steps. | |||
| input is already in a UCS-based encoding (for example UTF-8 or UTF- | ||||
| 16). In that case, it is assumed that the IRI is already in NFC. | ||||
| Part A) This part has three variants, depending on where the input | ||||
| comes from. | ||||
| Variant 1) a) Start with an IRI written on paper or read out | ||||
| loud, or otherwise represented as a sequence of | ||||
| characters independent of any encoding. b) Represent the | ||||
| IRI characters as a sequence of characters from the UCS. | ||||
| c) Normalize the character sequence according to | ||||
| Normalization Form C (NFC), as defined in [UNI15]. (See | ||||
| further discussion in Section 3.1.) | ||||
| Note: In practice, steps b) and c) will often be | Step 1) This step generates a UCS-based encoding from the original | |||
| performed together, for example by using a keyboard or | IRI format. This step has three variants, depending on the | |||
| other input mechanism that is designed to produce NFC. | form of the input. | |||
| Variant 2) a) Start with an IRI in some digital | Variant A) If the IRI is written on paper or read out loud, | |||
| representation (e.g. an octet stream) in some non- | or otherwise represented as a sequence of characters | |||
| Unicode encoding. b) Represent the IRI characters as a | independent of any encoding: Represent the IRI as a | |||
| sequence of characters from the UCS. c) Normalize the | sequence of characters from the UCS normalized according | |||
| character sequence according to Normalization Form C, as | to Normalization Form C (NFC, [UNI15]). | |||
| defined in [UNI15]. (See further discussion in Section | ||||
| 3.1.) | ||||
| Note: In practice, steps b) and c) will often be | Variant B) If the IRI is in some digital representation | |||
| performed together, for example by using a transcoder | (e.g. an octet stream) in some non-Unicode encoding: | |||
| that produces output in NFC. | Convert the IRI to a sequence of characters from the UCS | |||
| normalized according to NFC. | ||||
| Variant 3) a) Start with an IRI in an Unicode-based encoding | Variant C) If the IRI is in an Unicode-based encoding (for | |||
| (for example UTF-8 or UTF-16). Move directly to Part 2. | example UTF-8 or UTF-16): Do not normalize. Move | |||
| It is assumed that the IRI is already in NFC. | directly to Step 2. | |||
| Part B) For each character that is disallowed in URI references, | Step 2) For each character that is disallowed in URI references, | |||
| apply steps a) through c) below. The disallowed characters | apply steps 1) through 3) below. The disallowed characters | |||
| consist of all non-ASCII characters, plus the excluded | consist of all non-ASCII characters, plus the excluded | |||
| characters listed in Section 2.4 of [RFC2396], except for the | characters listed in Section 2.4 of [RFC2396], except for the | |||
| number sign (#) and percent sign (%) and the square bracket | number sign (#) and percent sign (%) and the square bracket | |||
| characters re-allowed in [RFC2732]. | characters re-allowed in [RFC2732]. | |||
| 1) Convert the character to a sequence of one or more octets | 1) Convert the character to a sequence of one or more octets | |||
| using UTF-8 [RFC2279]. | using UTF-8 [RFC2279]. | |||
| 2) Convert each octet to %HH, where HH is the hexadecimal | 2) Convert each octet to %HH, where HH is the hexadecimal | |||
| notation of the octet value. Note: This is identical to | notation of the octet value. Note: This is identical to | |||
| the escaping mechanism in Section 2.4.1 of [RFC2396]. | the escaping mechanism in Section 2.4.1 of [RFC2396]. | |||
| 3) Replace the original character by the resulting character | 3) Replace the original character by the resulting character | |||
| sequence. | sequence. | |||
| Note that in this process (in step B3), characters allowed in URI | Note that in this process (in step 2.3), characters allowed in URI | |||
| references and existing escape sequences are not escaped further. | references and existing escape sequences are not escaped further. | |||
| (This mapping is similar to, but different from, the escaping applied | (This mapping is similar to, but different from, the escaping applied | |||
| when including arbitrary content into some part of a URI.) | when including arbitrary content into some part of a URI.) | |||
| The above mapping produces a URI fully conforming to [RFC2396] out of | The above mapping produces a URI fully conforming to [RFC2396] (as | |||
| each IRI. The mapping is also an identity transformation for URIs | amended by [RFC2732] and [IDNURI]) out of each IRI. The mapping is | |||
| and is idempotent--applying the mapping a second time will not change | also an identity transformation for URIs and is idempotent -- | |||
| anything. Every URI is therefore by definition an IRI. Section 2.3 | applying the mapping a second time will not change anything. Every | |||
| gives details about when exactly to convert from an IRI to an URI. | URI is therefore by definition an IRI. | |||
| 2.3.1 When to convert from IRIs to URIs | ||||
| The mapping from IRIs to URIs SHOULD only be applied when necessary, | Note: For backwards compatibility with infrastructure that does not | |||
| and as late as possible. | implement the updates of [IDNURI], converters MAY also convert the | |||
| 'ihostname' part of an IRI using the ToASCII operation specified in | ||||
| Section 4.1 of [IDNA] between Step 1 and Step 2. Note that the | ||||
| ToASCII operation may fail. Note that Internationalized Domain Names | ||||
| may be contained in parts of an IRI other than the 'ihostname' part. | ||||
| 2.4 Converting URIs to IRIs | 3.2 Converting URIs to IRIs | |||
| In some situations, it may be desirable to try to convert a URI into | In some situations, it may be desirable to try to convert a URI into | |||
| an equivalent IRI. This section gives a procedure to do such a | an equivalent IRI. This section gives a procedure to do such a | |||
| conversion. In general, the IRI to URI mapping is many-to-one, so | conversion. The conversion described in this section will always | |||
| the conversion is not invertible. The conversion described in this | give an IRI which maps back to the URI that was used as an input for | |||
| section will always give an IRI which maps back to the URI that was | the conversion, but perhaps not exactly the original IRI (if there | |||
| used as an input for the conversion, but perhaps not exactly the | ever was one). | |||
| original IRI (if there ever was one). In general, URI to IRI | ||||
| conversion removes escape sequences, but not all escaping can be | URI to IRI conversion removes escape sequences, but not all escaping | |||
| eliminated. There are many reasons for this: | can be eliminated. There are many reasons for this: | |||
| a. Some escape sequences are necessary to distinguish escaped and | a. Some escape sequences are necessary to distinguish escaped and | |||
| unescaped uses of reserved characters. | unescaped uses of reserved characters. | |||
| b. Some escape sequences cannot be interpreted as sequences of | b. Some escape sequences cannot be interpreted as sequences of | |||
| UTF-8 octets. | UTF-8 octets. | |||
| (Note: Due to the regularities in the octet patterns of UTF-8, | (Note: Due to the regularities in the octet patterns of UTF-8, | |||
| there is a very high probability, but no guarantee, that escape | there is a very high probability, but no guarantee, that escape | |||
| sequences that can be interpreted as sequences of UTF-8 octets | sequences that can be interpreted as sequences of UTF-8 octets | |||
| actually originated from UTF-8. For a detailed discussion of | actually originated from UTF-8. For a detailed discussion, see | |||
| the odds, see [Duer97].) | [Duer97].) | |||
| c. The conversion may result in a character that is not | c. The conversion may result in a character that is not | |||
| appropriate in an IRI. See section 3.1 for further details. | appropriate in an IRI. See Section 5.1 for further details. | |||
| Conversion from a URI to an IRI is done using the following steps (or | Conversion from a URI to an IRI is done using the following steps (or | |||
| any other algorithm that produces the same result): | any other algorithm that produces the same result): | |||
| 1) Represent the URI as a sequence of octets in US-ASCII. | 1) Represent the URI as a sequence of octets in US-ASCII. | |||
| 2) Convert all hexadecimal escapes (% followed by two hexadecimal | 2) Convert all hexadecimal escapes (% followed by two hexadecimal | |||
| digits) of %80 and higher to the corresponding octets. | digits) of %80 and higher to the corresponding octets. | |||
| 3) Re-escape any octets that are not part of a strictly legal UTF- | 3) Re-escape any octets that are not part of a strictly legal UTF- | |||
| 8 octet sequence. | 8 octet sequence. | |||
| 4) Re-escape all octets that in UTF-8 reperesent characters that | 4) Re-escape all octets that in UTF-8 represent characters that | |||
| are not appropriate according to Section 3.1. | are not appropriate according to Section 5.1. | |||
| 5) Interpret the resulting octet sequence as a sequence of | ||||
| characters encoded in UTF-8. | ||||
| This procedure will convert as many escaped non-ASCII characters as | This procedure will convert as many escaped non-ASCII characters as | |||
| possible to characters in an IRI. Because there are some choices | possible to characters in an IRI. Because there are some choices | |||
| when applying step 3) (see Section 3.1), results may differ. | when applying step 4) (see Section 5.1), results may differ. | |||
| 3. Considerations for use of IRIs | 4. Bidirectional IRIs for Right-to-left Languages | |||
| 3.1 IRI Character Limitations | Some UCS characters, such as those used in the Arabic and Hebrew | |||
| script, have an inherent right-to-left writing direction. IRIs | ||||
| containing such characters (called bidirectional IRIs or Bidi IRIs) | ||||
| require additional attention because of the non-trivial relation | ||||
| between logical representation (used for digital representation as | ||||
| well as when reading/spelling) and visual representation (used for | ||||
| display/printing). | ||||
| Not all characters of the UCS are appropriate for use as resource | 4.1 Bidi IRI Structure | |||
| identifiers. This section discusses the limitations on characters | ||||
| and character sequences usable for IRIs. The considerations in this | ||||
| section are relevant when creating IRIs and when converting from URIs | ||||
| to IRIs. | ||||
| Because of the large and increasing number of characters in the UCS | IRIs have an inherent structure that distinguishes structural | |||
| and the large number of situations where IRIs can be used, it is | characters (usually punctuation such as '@', '.', ':', '/', and so | |||
| impossible to give general rules for which characters should be | on) called delimiters and payload components (usually consisting | |||
| avoided. The following considerations are relevant: | mostly of letters and digits). | |||
| ISSUE: Exact definition of components. | ||||
| In their internal digital representation, i.e. stored or transmitted | ||||
| for resolution, bidirectional IRIs MUST be in full logical order both | ||||
| for the overall structure as well as for the individual components. | ||||
| They MUST conform directly to the IRI syntax rules (which includes | ||||
| the rules relevant to their scheme). This is necessary to make sure | ||||
| that bidirectional IRIs can be processed in the same way as other | ||||
| IRIs. | ||||
| The components have the following restrictions: | ||||
| 1) A component MUST NOT not use both right-to-left and left-to- | ||||
| right characters. | ||||
| 2) A component MUST NOT contain bidirectional formatting | ||||
| characters. | ||||
| 3) A component using right-to-left characters MUST NOT use any | ||||
| other class of characters (e.g. neutrals or numbers). | ||||
| Note: Restrictions 1) and 2) are not very severe, in that they do not | ||||
| overly restrict useful identifiers. Also, trying to remove it would | ||||
| make it impossible for humans to predict the logical sequence of | ||||
| characters inside a single component. On the other hand, it would be | ||||
| very desirable to remove or at least soften restriction 3). | ||||
| Otherwise, it is impossible to combine Arabic or Hebrew letters with | ||||
| numbers, or to use a hyphen between two subcomponents of an Arabic | ||||
| component to avoid the cursive connection of the two subcomponents. | ||||
| To a certain extent, softening this restriction should be easily | ||||
| possible by adding additional formatting characters in well defined | ||||
| ways similar to the provisions in Section 4.2. Feedback on this | ||||
| issue is particularly welcome. | ||||
| 4.2 Visual Rendering of Bidi IRIs | ||||
| Bidirectional IRIs MUST be rendered visually by rendering each | ||||
| component and each structural character from left to right. They | ||||
| MUST render each component according to its natural direction (i.e. | ||||
| left-to-right for components with left-to-right characters, right-to- | ||||
| left for components with right-to-left characters). | ||||
| ISSUE: The alternative is to display a series of right-to-left | ||||
| components in their natural (right-to-left) order. This has the | ||||
| advantage that it will often be easier for native people to read the | ||||
| components in the right order. The restrictions on individual | ||||
| components change. In some cases, the correct visual rendering is | ||||
| automatic (i.e. exactly the same as with the Unicode algorithm), and | ||||
| so in these cases, no bidi formatting characters have to be added. | ||||
| In a textual context, i.e. assuming rendering by the Unicode | ||||
| bidirectional algorithm, the visual rendering backing store is done | ||||
| as follows: | ||||
| The visual representation uses some of the following Bidi formatting | ||||
| characters described by using a XML-style entity notation: | ||||
| ‎ U+200E LEFT-TO-RIGHT MARK | ||||
| ‏ U+200F RIGHT-TO-LEFT MARK | ||||
| &lre; U+202A LEFT-TO-RIGHT EMBEDDING | ||||
| &rle; U+202B RIGHT-TO-LEFT EMBEDDING | ||||
| &pdf; U+202C POP DIRECTIONAL FORMATTING | ||||
| &lro; U+202D LEFT-TO-RIGHT OVERRIDE | ||||
| &rlo; U+202E RIGHT-TO-LEFT OVERRIDE | ||||
| Each component with right-to-left characters is preceded and | ||||
| followed by an ‎. This left-to-right mark provides a left- | ||||
| to-right context to intervening syntactic characters. | ||||
| If the overall context (base directionality) is right-to-left, | ||||
| the identifier is preceded by an &lre; and followed by a &pdf;. | ||||
| This makes sure that the components of the identifier are | ||||
| rendered in left-to-right order. This may also be done by | ||||
| using the equivalent features of a higher-order protocol (e.g. | ||||
| by using the dir='ltr' attribute in HTML). | ||||
| 4.3 Input of Bidi IRIs | ||||
| Bidi input methods MUST generate Bidi IRIs in logical order while | ||||
| rendering them according to Section 4.2. During input, rendering | ||||
| should be updated after every new character that is input to avoid | ||||
| end user confusion. | ||||
| 5. Use of IRIs | ||||
| 5.1 Limitations on UCS Character Allowed in IRI | ||||
| This section discusses the limitations on characters and character | ||||
| sequences usable for IRIs. The considerations in this section are | ||||
| relevant when creating IRIs and when converting from URIs to IRIs. | ||||
| a. The repertoire of characters allowed in each IRI component is | a. The repertoire of characters allowed in each IRI component is | |||
| limited by the definition of that component. For example, the | limited by the definition of that component. For example, the | |||
| definition of host names in URIs does not currently allow hex | definition of the scheme component does not allow characters | |||
| escapes, or "_", or many other punctuation characters. This | beyond US-ASCII. | |||
| specification does not relax those limits, and so IRIs | ||||
| currently may not contain any non-ASCII characters in host | ||||
| names. This specification likewise does not extended the | ||||
| scheme component beyond US-ASCII. | ||||
| (Note: In accordance with URI practice, generic IRI software | (Note: In accordance with URI practice, generic IRI software | |||
| cannot and should not check for such limitations.) | cannot and should not check for such limitations.) | |||
| b. In the URI syntax, characters that are likely to be used to | b. In the URI syntax, characters that are likely to be used to | |||
| delimit URIs in text and print ("space", "delims", and | delimit URIs in text and print ("space", "delims", and | |||
| "unwise") were excluded. They are included in the IRI syntax, | "unwise") were excluded. They are included in the IRI syntax, | |||
| for the following reasons: | for the following reasons: | |||
| 1) The syntax includes many other characters that are not | 1) The syntax includes many other characters that are not | |||
| skipping to change at page 12, line 34 | skipping to change at page 16, line 35 | |||
| 3) It is very convenient in some cases, for example for | 3) It is very convenient in some cases, for example for | |||
| XPointers in XML attributes. | XPointers in XML attributes. | |||
| 4) Considering context is already necessary in the case of | 4) Considering context is already necessary in the case of | |||
| URIs, for example for "&" in XML. | URIs, for example for "&" in XML. | |||
| However, these characters should be used carefully. Whenever | However, these characters should be used carefully. Whenever | |||
| there is a chance that an IRI will be used in a component where | there is a chance that an IRI will be used in a component where | |||
| these characters can be harmful, they should be escaped. | these characters can be harmful, they should be escaped. | |||
| c. The UCS contains many areas of "characters" which have no | c. The UCS contains many areas of characters for which there are | |||
| well-established way of inputting them. These should be | ||||
| avoided. Characters that fall into this category include | ||||
| Dingbats, Mathematical and other symbols, ligatures and | ||||
| presentation forms. | ||||
| d. The UCS contains many areas of characters for which there are | ||||
| strong visual look-alikes. Because of the likelihood of | strong visual look-alikes. Because of the likelihood of | |||
| transcription errors, these also should be avoided. This | transcription errors, these also should be avoided. This | |||
| includes the full-width equivalents of ASCII characters, half- | includes the full-width equivalents of ASCII characters, half- | |||
| width Katakana characters for Japanese, and many others. This | width Katakana characters for Japanese, and many others. This | |||
| also includes many look-alikes of "space", "delims", and | also includes many look-alikes of "space", "delims", and | |||
| "unwise", characters excluded in [RFC2396]. | "unwise", characters excluded in [RFC2396]. | |||
| e. Characters with no visual representation may not be | Additional information is available from [UNIXML]. Although [UNIXML] | |||
| interoperably entered. Control characters MUST NOT be used. | is written in a different context, it discusses many of the | |||
| This includes the traditional ranges of control characters | categories of characters and code points not appropriate for IRIs. | |||
| (U+0000-U+001F and U+007F-U+009F) as well as other cases such | ||||
| as plane-14 language tag characters. | ||||
| f. Some code points are reserved for private use or for special | 5.2 Software Interfaces and Protocols | |||
| encoding purposes. They are not interoperable. Code points | ||||
| reserved for private use MUST NOT be used. Code points | ||||
| reserved for surrogates MUST NOT be used. | ||||
| g. Where there exist duplicate ways of encoding a certain | Although an IRI is defined as a sequence of characters, software | |||
| character as visible to the user, Normalization Form C as | interfaces for URIs typically function on sequences of octets. Thus, | |||
| defined in [UNI15] MUST be used. | software interfaces and protocols MUST define which character | |||
| encoding is used. | ||||
| Additional information is available from [UNIXML]. Although this is | Intermediate software interfaces between IRI-capable components and | |||
| written in a different context, it discusses many of the categories | URI-only components MUST map the IRIs as per Section 3.1, when | |||
| of characters and code points not appropriate for IRIs. | transferring from IRI-capable to URI-only components. Such a mapping | |||
| SHOULD be applied as late as possible. It should not be applied | ||||
| between components that are known to be able to handle IRIs. | ||||
| For reasons of transcribability, many characters have been excluded | 5.3 Format of URIs and IRIs in Documents and Protocols | |||
| from IRIs above. These can nevertheless be encoded in an IRI if | ||||
| necessary. They have to be escaped using the procedure in Section | ||||
| 2.3. For example, a space can always be encoded in a URI and in an | ||||
| IRI as %20. A non-breaking space (U+00A0) must be encoded as %C2%A0. | ||||
| 3.2 Bidirectional IRIs for right-to-left languages | Document formats that transport URIs may need to be upgraded to allow | |||
| the transport of IRIs. In those cases where the document as a whole | ||||
| has a native character encoding, IRIs MUST also be encoded in this | ||||
| encoding, and converted accordingly by a parser or interpreter. IRI | ||||
| characters that are not expressible in the native encoding SHOULD be | ||||
| escaped using the escaping conventions of the document format if such | ||||
| conventions are available. Alternatively, they MAY be escaped | ||||
| according to Section 3.1. For example, in HTML, XML, or SGML, | ||||
| numeric character references should be used. If a document as a | ||||
| whole has a native character encoding, and that character encoding is | ||||
| not UTF-8, then IRIs MUST NOT be placed into the document in the UTF- | ||||
| 8 character encoding. | ||||
| Some UCS characters, such as those used in the Arabic and Hebrew | Note: Some formats already accommodate IRIs, although they use | |||
| script, have an inherent right-to-left writing direction. IRIs | different terminology. HTML 4.0 [HTML4] defines the conversion from | |||
| containing such characters (called bidirectional IRIs or Bidi IRIs) | IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink | |||
| require additional attention because of the non-trivial relation | [XLink], and XML Schema [XMLSchema] and specifications based upon | |||
| between logical representation (used for digital representation as | them allow IRIs. Also, it is expected that all relevant new W3C | |||
| well as when reading/spelling) and visual representation (used for | formats and protocols will be required to handle IRIs [CharMod]. | |||
| display/printing). This document does not address bidi-specific | ||||
| issues. A proposal for addressing these issues can be found in | ||||
| [Bidi]. | ||||
| 3.3 Processing IRIs | 5.4 Relative IRI References | |||
| Processing of relative forms of IRIs against a base is handled | Processing of relative forms of IRIs against a base is handled | |||
| straightforwardly; the algorithms of RFC 2396 may be applied | straightforwardly; the algorithms of RFC 2396 may be applied | |||
| directly, treating the characters additionally allowed in IRIs in the | directly, treating the characters additionally allowed in IRIs in the | |||
| same way as unreserved characters in URIs. Other processing | same way as unreserved characters in URIs. | |||
| operations on IRIs and IRI references similarly work analogous to | ||||
| their URI complements. | ||||
| Such processing and mapping to URIs is commutative, which means that | ||||
| the same result is obtained independent of whether the processing or | ||||
| the mapping is done first. If both IRIs and URIs are involved in | ||||
| processing, the IRI parts SHOULD be preserved as long as possible. | ||||
| For example, it is possible to create an absolute IRI from a relative | ||||
| IRI and a URI base. When IRIs are compared, they SHOULD temporarily | ||||
| be mapped to URIs to eliminate potential differences in the degree of | ||||
| escaping. | ||||
| 4. Software requirements | 6. URI/IRI Processing Guidelines (informative) | |||
| This section explains the issues and difficulties in supporting IRIs | This informative section provides guidelines for supporting IRIs in | |||
| in the same software components and operations that currently process | the same software components and operations that currently process | |||
| URIs: software interfaces that handle URIs, software that allows | URIs: software interfaces that handle URIs, software that allows | |||
| users to enter URIs, software that generates URIs, software that | users to enter URIs, software that generates URIs, software that | |||
| displays URIs, formats and protocols that transport URIs, and | displays URIs, formats and protocols that transport URIs, and | |||
| software that interprets URIs. These may all require more or less | software that interprets URIs. These may all require more or less | |||
| modification before functioning properly with IRIs. The | modification before functioning properly with IRIs. The | |||
| considerations in this section also apply to URI references and IRI | considerations in this section also apply to URI references and IRI | |||
| references. | references. | |||
| 4.1 URI/IRI software interfaces | 6.1 URI/IRI Software Interfaces | |||
| Software interfaces that handle URIs, such as URI-handling APIs and | Software interfaces that handle URIs, such as URI-handling APIs and | |||
| protocols transferring URIs, need interfaces and protocol elements | protocols transferring URIs, need interfaces and protocol elements | |||
| that are designed to carry IRIs. | that are designed to carry IRIs. | |||
| Note that although an IRI is defined as a sequence of characters, | ||||
| software interfaces for URIs typically function on sequences of | ||||
| octets. Thus, it is necessary to define clearly which character | ||||
| encoding is used. | ||||
| In case the current handling in an API or protocol is based on US- | In case the current handling in an API or protocol is based on US- | |||
| ASCII, UTF-8 is recommended as the encoding for IRIs, because this is | ASCII, UTF-8 is recommended as the encoding for IRIs, because this is | |||
| compatible with US-ASCII, is in accordance with the recommendations | compatible with US-ASCII, is in accordance with the recommendations | |||
| of [RFC2277], and makes it easy to convert to URIs where necessary. | of [RFC2277], and makes it easy to convert to URIs where necessary. | |||
| In any case, the encoding used must not be left undefined. | In any case, the encoding used must not be left undefined. | |||
| Intermediate software interfaces between IRI-capable components and | ||||
| URI-only components MUST map the IRIs as per section 2.3 above, when | ||||
| transferring from IRI-capable to URI-only components. However, such | ||||
| a mapping SHOULD be applied as late as possible. It should not be | ||||
| applied between components that are known to be able to handle IRIs. | ||||
| The transfer from URI-only to IRI-capable components requires no | The transfer from URI-only to IRI-capable components requires no | |||
| mapping, although the conversion described in section 2.4 above may | mapping, although the conversion described in Section 3.2 above may | |||
| be performed. It is preferable not to perform this inverse | be performed. It is preferable not to perform this inverse | |||
| conversion when there is a chance that this cannot be done correctly. | conversion when there is a chance that this cannot be done correctly. | |||
| 4.2 URI/IRI entry | 6.2 URI/IRI Entry | |||
| There are components that allow users to enter URIs into the system, | There are components that allow users to enter URIs into the system, | |||
| for example, by typing or dictation. This software must be updated | for example, by typing or dictation. This software must be updated | |||
| to allow for IRI entry. | to allow for IRI entry. | |||
| A person viewing a visual representation of an IRI (as a sequence of | A person viewing a visual representation of an IRI (as a sequence of | |||
| glyphs, in some order, in some visual display) or hearing an IRI, | glyphs, in some order, in some visual display) or hearing an IRI, | |||
| will use a entry method for characters in the user's language to | will use a entry method for characters in the user's language to | |||
| input the IRI. Depending on the script and the input method used, | input the IRI. Depending on the script and the input method used, | |||
| this may be a more or less complicated process. | this may be a more or less complicated process. | |||
| The process of IRI entry must assure, as far as possible, that the | The process of IRI entry must assure, as far as possible, that the | |||
| limitations defined in Section 3.1 are met. This may be done by | restrictions defined in Section 2.2 are met. This may be done by | |||
| choosing appropriate input methods or variants/settings thereof, by | choosing appropriate input methods or variants/settings thereof, by | |||
| appropriately converting the characters being input, by eliminating | appropriately converting the characters being input, by eliminating | |||
| characters that cannot be converted, and/or by issuing a warning or | characters that cannot be converted, and/or by issuing a warning or | |||
| error message to the user. | error message to the user. | |||
| An input field primarily or only used for the input of URIs/IRIs | An input field primarily or only used for the input of URIs/IRIs | |||
| should allow the user to view an IRI as converted to a URI. Places | should allow the user to view an IRI as converted to a URI. Places | |||
| where the input of IRIs is frequent should provide the possibility | where the input of IRIs is frequent should provide the possibility | |||
| for viewing an IRI as converted to a URI. This will help users when | for viewing an IRI as converted to a URI. This will help users when | |||
| some of the software they use does not yet accept IRIs. | some of the software they use does not yet accept IRIs. | |||
| An IRI input component that interfaces to components that handle | An IRI input component that interfaces to components that handle | |||
| URIs, but not IRIs, must escape the IRI before passing it to such a | URIs, but not IRIs, must escape the IRI before passing it to such a | |||
| component. | component. | |||
| For the input of IRIs with right-to-left characters, please see | For the input of IRIs with right-to-left characters, please see | |||
| [Bidi]. | Section 4. | |||
| 4.3 URI/IRI generation | 6.3 URI/IRI Generation | |||
| Systems that are offering resources through the Internet, where those | Systems that are offering resources through the Internet, where those | |||
| resources have logical names, sometimes automatically generate URIs | resources have logical names, sometimes automatically generate URIs | |||
| for the resources they offer. For example, some HTTP servers can | for the resources they offer. For example, some HTTP servers can | |||
| generate a directory listing for a file directory, and then respond | generate a directory listing for a file directory, and then respond | |||
| to the generated URIs with the files. | to the generated URIs with the files. | |||
| Many legacy character encodings are in use in various file systems. | Many legacy character encodings are in use in various file systems. | |||
| Many currently deployed systems do not transform the local character | Many currently deployed systems do not transform the local character | |||
| representation of the underlying system before generating URIs. | representation of the underlying system before generating URIs. | |||
| skipping to change at page 16, line 6 | skipping to change at page 19, line 29 | |||
| use IRIs converted to URIs in cases where it cannot be expected that | use IRIs converted to URIs in cases where it cannot be expected that | |||
| the recipient is able to handle IRIs. Due to the way most user | the recipient is able to handle IRIs. Due to the way most user | |||
| agents currently work, native IRIs, encoded in UTF-8, may be used if | agents currently work, native IRIs, encoded in UTF-8, may be used if | |||
| the recipient announces that it can interpret UTF-8. This requires | the recipient announces that it can interpret UTF-8. This requires | |||
| that the whole page is sent as UTF-8. If this is not possible, | that the whole page is sent as UTF-8. If this is not possible, | |||
| escaping can always be used. | escaping can always be used. | |||
| This recommendation in particular applies to HTTP servers. For FTP | This recommendation in particular applies to HTTP servers. For FTP | |||
| servers, similar considerations apply, see in particular [RFC2640]. | servers, similar considerations apply, see in particular [RFC2640]. | |||
| 4.4 URI/IRI selection | 6.4 URI/IRI Selection | |||
| In some cases, resource owners and publishers have control over the | In some cases, resource owners and publishers have control over the | |||
| IRIs used to identify their resources. Such control is mostly | IRIs used to identify their resources. Such control is mostly | |||
| executed by controlling the resource names, such as file names, | executed by controlling the resource names, such as file names, | |||
| directly. | directly. | |||
| In such cases, it is recommended to avoid choosing IRIs that are | In such cases, it is recommended to avoid choosing IRIs that are | |||
| easily confused. For example, for US-ASCII, the lower-case ell "l" | easily confused. For example, for US-ASCII, the lower-case ell "l" | |||
| is easily confused with the digit one "1", and the upper-case oh "O" | is easily confused with the digit one "1", and the upper-case oh "O" | |||
| is easily confused with the digit zero "0". Publishers should avoid | is easily confused with the digit zero "0". Publishers should avoid | |||
| skipping to change at page 16, line 31 | skipping to change at page 20, line 5 | |||
| here. As long as names are limited to characters from a single | here. As long as names are limited to characters from a single | |||
| script, native writers of a given script or language will know best | script, native writers of a given script or language will know best | |||
| when ambiguities can appear, and how they can be avoided. What may | when ambiguities can appear, and how they can be avoided. What may | |||
| look ambiguous to a stranger may be completely obvious to the average | look ambiguous to a stranger may be completely obvious to the average | |||
| native user. On the other hand, in some cases, the UCS contains | native user. On the other hand, in some cases, the UCS contains | |||
| variants for compatibility reasons, for example for typographic | variants for compatibility reasons, for example for typographic | |||
| purposes. These should be avoided wherever possible. Although there | purposes. These should be avoided wherever possible. Although there | |||
| may be exceptions, in general newly created resource names should be | may be exceptions, in general newly created resource names should be | |||
| in NFKC [UNI15] (which means that they are also in NFC). | in NFKC [UNI15] (which means that they are also in NFC). | |||
| Note that the limitations defined in Section 3.1 and the | ||||
| recommendations given here are of a different nature. The | ||||
| limitations defined in Section 3.1 are necessary to avoid duplicate | ||||
| encodings that are artifacts of digital representation and that the | ||||
| user has no way to distinguish visually. On the other hand, in a | ||||
| given context, an identifier such as "BOX0021" can be completely | ||||
| appropriate, and it is impossible to find an algorithm that | ||||
| distinguishes the appropriate from the confusing identifiers. | ||||
| In certain cases, there is a chance that letters from different | In certain cases, there is a chance that letters from different | |||
| scripts look the same. The best known example is the Latin 'A', the | scripts look the same. The best known example is the Latin 'A', the | |||
| Greek 'Alpha', and the Cyrillic 'A'. To avoid such cases, only IRIs | Greek 'Alpha', and the Cyrillic 'A'. To avoid such cases, only IRIs | |||
| should be generated where all the letters in a single component are | should be generated where all the letters in a single component are | |||
| from the same script. This is similar to the heuristics used to | from the same script. This is similar to the heuristics used to | |||
| distinguish between letters and numbers in the examples above. Also, | distinguish between letters and numbers in the examples above. Also, | |||
| for the above three scripts, using lower-case letters results in | for the above three scripts, using lower-case letters results in | |||
| fewer ambiguities than using upper-case letters. | fewer ambiguities than using upper-case letters. | |||
| 4.5 Display of URIs/IRIs | 6.5 Display of URIs/IRIs | |||
| Many systems contain software that presents URIs to users as part of | ||||
| the system's user interface (sometimes presenting 'friendly' URIs, | ||||
| such as a shortened or more legible substring of the URI). This | ||||
| section applies to this presentation, as well as to the strategy for | ||||
| printing URIs in magazines, newspapers, or reading them over the | ||||
| radio. | ||||
| Software that displays identifiers to users should follow a general | ||||
| principle: "Don't display something to a user that the user would not | ||||
| be able to enter." The consequences of this principle require | ||||
| judgement about the availability of software that implements the | ||||
| entry methods described in Section 3.2. | ||||
| a) In situations where a viewer is not likely to have software | ||||
| that implements non-ASCII character entry (as described in | ||||
| Section 3.1), or where it can be expected that only a limited | ||||
| range of non-ASCII characters can be entered, any part of an | ||||
| IRI containing characters outside the range allowed in | ||||
| [RFC2396] or any additions SHOULD be escaped before being | ||||
| displayed. | ||||
| b) In situations where a viewer _is_ likely to have such software, | In situations where the rendering software is not expected to display | |||
| IRIs SHOULD be displayed directly. | non-ASCII parts of the IRI correctly using the available layout and | |||
| font resources, these parts should be escaped before being displayed. | ||||
| For display of Bidi IRIs, please see [Bidi]. | For display of Bidi IRIs, please see Section 4.2. | |||
| 4.6 Interpretation of URI/IRIs | 6.6 Interpretation of URIs and IRIs | |||
| Software that interprets IRIs as the names of local resources should | Software that interprets IRIs as the names of local resources should | |||
| accept IRIs in multiple forms, and convert and match them with the | accept IRIs in multiple forms, and convert and match them with the | |||
| appropriate local resource names. | appropriate local resource names. | |||
| First, multiple representations include both IRIs in the native | First, multiple representations include both IRIs in the native | |||
| character encoding of the protocol and also their URI counterparts. | character encoding of the protocol and also their URI counterparts. | |||
| Second, it may include URIs constructed based on other character | Second, it may include URIs constructed based on other character | |||
| encodings than UTF-8. Such URIs may be produced by user agents that | encodings than UTF-8. Such URIs may be produced by user agents that | |||
| skipping to change at page 18, line 15 | skipping to change at page 21, line 7 | |||
| the accents on received IRIs or resource names where appropriate. | the accents on received IRIs or resource names where appropriate. | |||
| Please note that such mappings, including case mappings, are | Please note that such mappings, including case mappings, are | |||
| language-dependent. | language-dependent. | |||
| It can be difficult to unambiguously identify a resource if too many | It can be difficult to unambiguously identify a resource if too many | |||
| mappings are taken into consideration. However, escaped and non- | mappings are taken into consideration. However, escaped and non- | |||
| escaped parts of IRIs can always clearly be distinguished. Also, the | escaped parts of IRIs can always clearly be distinguished. Also, the | |||
| regularity of UTF-8 (see [Duer97] makes the potential for collisions | regularity of UTF-8 (see [Duer97] makes the potential for collisions | |||
| lower than it may seem at first sight. | lower than it may seem at first sight. | |||
| 4.7 Transportation of URI/IRIs in document formats and protocols | 6.7 Upgrading Strategy | |||
| Document formats that transport URIs may need to be upgraded to allow | ||||
| the transport of IRIs. In those cases where the document as a whole | ||||
| has a native character encoding, IRIs SHOULD also be encoded in this | ||||
| encoding, and converted accordingly by a parser or interpreter. IRI | ||||
| characters that are not expressible in the native encoding SHOULD be | ||||
| escaped according to Section 2.2, or MAY be escaped in another way if | ||||
| the document format provides a way to do this. For example, in HTML, | ||||
| XML, or SGML, numeric character references can be used. If a | ||||
| document as a whole has a native character encoding, and that | ||||
| character encoding is not UTF-8, then IRIs MUST NOT be placed into | ||||
| the document in the UTF-8 character encoding. | ||||
| Please note that some formats already accomodate IRIs, although they | ||||
| use different terminology. HTML 4.0 [HTML4] defines the conversion | ||||
| from IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink | ||||
| [XLink], and XML Schema [XMLSchema] and specifications based upon | ||||
| them allow IRIs. Also, it is expected that all relevant new W3C | ||||
| formats and protocols will be required to handle IRIs [CharMod]. | ||||
| 5. Upgrading strategy | ||||
| As this recommendation places further constraints on software for | As this recommendation places further constraints on software for | |||
| which many instances are already deployed, it is important to | which many instances are already deployed, it is important to | |||
| introduce upgrades carefully, and to be aware of the various | introduce upgrades carefully, and to be aware of the various | |||
| interdependencies. | interdependencies. | |||
| If IRIs cannot be interpreted correctly, they should not be generated | If IRIs cannot be interpreted correctly, they should not be generated | |||
| or transported. This suggests that upgrading URI interpreting | or transported. This suggests that upgrading URI interpreting | |||
| software to accept IRIs should have highest priority. | software to accept IRIs should have highest priority. | |||
| skipping to change at page 19, line 23 | skipping to change at page 21, line 42 | |||
| is known to transport them safely. | is known to transport them safely. | |||
| Display software should be upgraded only after upgraded entry | Display software should be upgraded only after upgraded entry | |||
| software has been widely deployed to the population that will see the | software has been widely deployed to the population that will see the | |||
| displayed result. | displayed result. | |||
| These recommendations, when taken together, will allow for the | These recommendations, when taken together, will allow for the | |||
| extension from URIs to IRIs in order to handle scripts other than | extension from URIs to IRIs in order to handle scripts other than | |||
| ASCII while minimizing interoperability problems. | ASCII while minimizing interoperability problems. | |||
| 6. Security considerations | 7. Security Considerations | |||
| If IRI entry software normalizes the characters entered, but the | Incorrect escaping or unescaping can lead to security problems. In | |||
| resource names on the interpreting side are not normalized | particular, some UTF-8 decoders do not check against overlong byte | |||
| accordingly, and the interpreting software does not take this into | sequences. As an example, a '/' is encoded with the byte 0x2F both | |||
| account, there is a possibility of "spoofing". Similar possibilities | in UTF-8 and in ASCII, but some UTF-8 decoders also wrongly interpret | |||
| turn up when interpreting software accepts URIs in various native | the sequence 0xC0 0xAF as a '/'. A sequence such as '%C0%AF..' may | |||
| encodings or allows accents and similar things to be ignored. | pass some security tests and then be interpreted as '/..' in a path | |||
| if UTF-8 decoders are fault-tolerant, if conversion and checking are | ||||
| not done in the right order, and/or if reserved characters and | ||||
| unreserved characters are not clearly distinguished. | ||||
| There are various ways in which "spoofing" can occur with IRIs. | ||||
| "Spoofing" means that somebody may add a resource name that looks the | "Spoofing" means that somebody may add a resource name that looks the | |||
| same or similar to the user while actually being different, or a | same or similar to the user, but points to a different resource. The | |||
| resource name that contains the same characters, but in a different | added resource may pretend to be the real resource by looking very | |||
| encoding. The added resource may pretend to be the real resource by | similar, but may contain all kinds of changes that may be difficult | |||
| looking very similar, but may contain all kinds of changes that may | to spot but can cause all kinds of problems. Most spoofing | |||
| be difficult to spot but can cause all kinds of problems. | possibilities for IRIs are extensions of those for URIs. | |||
| Conceptually, this is no different from the problems surrounding the | Spoofing can occur for various reasons. A first reason is that | |||
| use of case-insensitive web servers. For example, a popular web page | normalization expectations of a user or actual normalization when | |||
| with a mixed case name (http://big.site/PopularPage.html) might be | entering an IRI do not match the normalization used on the server | |||
| "spoofed" by someone who obtains access to (http://big.site/ | side. Conceptually, this is no different from the problems | |||
| popularpage.html). | surrounding the use of case-insensitive web servers. For example, a | |||
| popular web page with a mixed case name (http://big.site/ | ||||
| PopularPage.html) might be "spoofed" by someone who obtains access to | ||||
| http://big.site/popularpage.html. However, the introduction of | ||||
| character normalization, and of additional mappings for user | ||||
| convenience, may increase the chance for spoofing. | ||||
| However, the introduction of character normalization, of additional | Spoofing can occur due to the fact that in the UCS, there are many | |||
| mappings for user convenience, and of mappings for various encodings | characters that look very similar. Details are discussed in Section | |||
| may increase the number of spoofing possibilities. In some cases, in | 6.4. Again, this is very similar to spoofing possibilities on US- | |||
| particular for Latin-based resource names, this is usually easy to | ASCII, e.g. using 'br0ken' or '1ame' URIs. | |||
| detect because UTF-8-encoded names, when interpreted and viewed as | ||||
| legacy encodings, produce mostly garbage. In other cases, when | ||||
| concurrently used encodings have a similar structure, but there are | ||||
| no characters that have exactly the same encoding, detection is more | ||||
| difficult. A good example may be the concurrent use of Shift_JIS and | ||||
| EUC-JP on a Japanese server. | ||||
| Administrators of large sites which allow independent users to create | Spoofing can occur when URIs in various encodings are accepted to | |||
| subareas may need to be careful that the aliasing rules do not create | deal with older user agents. In some cases, in particular for Latin- | |||
| chances for spoofing. | based resource names, this is usually easy to detect because UTF-8- | |||
| encoded names, when interpreted and viewed as legacy encodings, | ||||
| produce mostly garbage. In other cases, when concurrently used | ||||
| encodings have a similar structure, but there are no characters that | ||||
| have exactly the same encoding, detection is more difficult. | ||||
| 7. Acknowlegdements | Spoofing can occur in various IRI components, such as the domain name | |||
| part or a path part. For considerations specific to the domain name | ||||
| part, see [Nameprep]. For the path part, administrators of sites | ||||
| which allow independent users to create resources in the same subarea | ||||
| may need to be careful to check for spoofing. | ||||
| We would like to thank Larry Masinter for his work as co-author of | 8. Change log | |||
| many earlier versions of this document. | ||||
| Changes from -00 to -01 | ||||
| - Re-integrated the section on Bidi, some issues left. | ||||
| - Integrated IDN, changed syntax (host, userinfo,....). | ||||
| - Moved some text around, marked some as informational. | ||||
| - Made a clear distinction of IRI use for identification only and | ||||
| for resource resolution. | ||||
| - Fixed various details in wording, spelling,... | ||||
| 9. Acknowlegdements | ||||
| We would like to thank Larry Masinter for his work as coauthor of | ||||
| many earlier versions of this document (draft-masinter-url-i18n-xx). | ||||
| The issue addressed here has been discussed at numerous times over | The issue addressed here has been discussed at numerous times over | |||
| the last years; for example, there was a thread in the HTML working | the last years; for example, there was a thread in the HTML working | |||
| group in August 1995 (under the topic of "Globalizing URIs") in the | group in August 1995 (under the topic of "Globalizing URIs") in the | |||
| www-international mailing list in July 1996 (under the topic of | www-international mailing list in July 1996 (under the topic of | |||
| "Internationalization and URLs"), and ad-hoc meetings at the Unicode | "Internationalization and URLs"), and ad-hoc meetings at the Unicode | |||
| conferences in September 1995 and September 1997. | conferences in September 1995 and September 1997. | |||
| Thanks to Francois Yergeau, Chris Wendt, Yaron Goland, Graham Klyne, | Thanks to Francois Yergeau, Chris Wendt, Yaron Goland, Graham Klyne, | |||
| Roy Fielding, Tim Berners-Lee, M.T. Carrasco Benitez, James Clark, | Roy Fielding, Tim Berners-Lee, M.T. Carrasco Benitez, James Clark, | |||
| skipping to change at page 20, line 37 | skipping to change at page 23, line 37 | |||
| Bjoern Hoehrmann, Dan Oscarson, and many others for help with | Bjoern Hoehrmann, Dan Oscarson, and many others for help with | |||
| understanding the issues and possible solutions. Thanks also to the | understanding the issues and possible solutions. Thanks also to the | |||
| members of the W3C I18N Working Group and Interest Group for their | members of the W3C I18N Working Group and Interest Group for their | |||
| contributions and their work on [CharMod], to the members of many | contributions and their work on [CharMod], to the members of many | |||
| other W3C WGs for adopting the ideas, and to the members of the | other W3C WGs for adopting the ideas, and to the members of the | |||
| Montreal IAB Workshop on Internationalization and Localization for | Montreal IAB Workshop on Internationalization and Localization for | |||
| their review. | their review. | |||
| References | References | |||
| [Bidi] Duerst, M., "Internet Identifiers and Bidirectionality", | [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., | |||
| draft-duerst-iri-bidi-00 (work in progress), July 2001, | Freytag, A. and T. Texin, "Character Model for the | |||
| <http://www.ietf.org/internet-drafts/draft-duerst-iri- | World Wide Web", World Wide Web Consortium Working | |||
| bidi-00.txt>. | Draft, April 2002, <http://www.w3.org/TR/charmod>. | |||
| [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., Freytag, | ||||
| A. and T. Texin, "Character Model for the World Wide | ||||
| Web", World Wide Web Consortium Working Draft, February | ||||
| 2002, <http://www.w3.org/TR/charmod>. | ||||
| [Duer97] Duerst, M., "The Properties and Promizes of UTF-8", | [Duer97] Duerst, M., "The Properties and Promises of UTF-8", | |||
| Proc. 11th International Unicode Conference, San Jose , | Proc. 11th International Unicode Conference, San Jose | |||
| September 1997, <http://www.ifi.unizh.ch/mml/mduerst/ | , September 1997, <http://www.ifi.unizh.ch/mml/ | |||
| papers/PDF/IUC11-UTF-8.pdf>. | mduerst/papers/PDF/IUC11-UTF-8.pdf>. | |||
| [Duer01] Duerst, M., "Internationalized Resource Identifiers: | [Duer01] Duerst, M., "Internationalized Resource Identifiers: | |||
| From Specification to Testing", Proc. 19th International | From Specification to Testing", Proc. 19th | |||
| Unicode Conference, San Jose , September 2001, <http:// | International Unicode Conference, San Jose , | |||
| www.w3.org/2001/Talks/0912-IUC-IRI/paper.html>. | September 2001, <http://www.w3.org/2001/Talks/0912- | |||
| IUC-IRI/paper.html>. | ||||
| [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 | [HTML4] Raggett, D., Le Hors, A. and I. Jacobs, "HTML 4.01 | |||
| Specification", World Wide Web Consortium | Specification", World Wide Web Consortium | |||
| Recommendation, December 1999, <http://www.w3.org/TR/ | Recommendation, December 1999, <http://www.w3.org/TR/ | |||
| REC-html40/appendix/notes.html#h-B.2>. | REC-html40/appendix/notes.html#h-B.2>. | |||
| [IDN-URI] Duerst, M., "Internationalized Domain Names in URIs and | [IDNURI] Duerst, M., "Internationalized Domain Names in URIs", | |||
| IRIs", draft-ietf-idn-uri-01 (work in progress), | draft-ietf-idn-uri-02.txt (work in progress), July | |||
| November 2001, <http://www.ietf.org/internet-drafts/ | 2002, <http://www.ietf.org/internet-drafts/draft- | |||
| draft-ietf-idn-uri-01.txt>. | ietf-idn-uri-02.txt>. | |||
| [IDNA] Faltstrom, P., Hoffman, P. and A. Faltstrom, | ||||
| "Internationalizing Domain Names in Applications | ||||
| (IDNA)", draft-ietf-idn-idna-09.txt (work in | ||||
| progress), May 2002, <http://www.ietf.org/internet- | ||||
| drafts/draft-ietf-idn-idna-09.txt>. | ||||
| [ISO10646] International Organization for Standardization, | [ISO10646] International Organization for Standardization, | |||
| "Information Technology - Universal Multiple-Octet Coded | "Information Technology - Universal Multiple-Octet | |||
| Character Set (UCS) - Part 1: Architecture and Basic | Coded Character Set (UCS) - Part 1: Architecture and | |||
| Multilingual Plane", ISO Standard 10646-1, with | Basic Multilingual Plane", ISO Standard 10646-1, with | |||
| amendments, October 2000. | amendments, October 2000. | |||
| [Nameprep] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | ||||
| Profile for Internationalized Domain Names", draft- | ||||
| ietf-idn-nameprep-10.txt (work in progress), May | ||||
| 2002, <http://www.ietf.org/internet-drafts/draft- | ||||
| ietf-idn-nameprep-10.txt>. | ||||
| [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
| Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
| [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., | [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, | |||
| Atkinson, R., Crispin, M. and P. Svanberg, "The Report | H., Atkinson, R., Crispin, M. and P. Svanberg, "The | |||
| of the IAB Character Set Workshop held 29 February - 1 | Report of the IAB Character Set Workshop held 29 | |||
| March, 1996", RFC 2130, April 1997. | February - 1 March, 1996", RFC 2130, April 1997. | |||
| [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | |||
| [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997. | [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September | |||
| 1997. | ||||
| [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | |||
| Languages", BCP 18, RFC 2277, January 1998. | Languages", BCP 18, RFC 2277, January 1998. | |||
| [RFC2279] Yergeau, F., "UTF-8, a transformation format of ISO | [RFC2279] Yergeau, F., "UTF-8, a transformation format of ISO | |||
| 10646", RFC 2279, January 1998. | 10646", RFC 2279, January 1998. | |||
| [RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998. | [RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998. | |||
| [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform | [RFC2396] Berners-Lee, T., Fielding, R. and L. Masinter, | |||
| Resource Identifiers (URI): Generic Syntax", RFC 2396, | "Uniform Resource Identifiers (URI): Generic Syntax", | |||
| August 1998. | RFC 2396, August 1998. | |||
| [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, August | [RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, | |||
| 1998. | August 1998. | |||
| [RFC2616] Fielding, R., Gettys, J., Mogul, J., Nielsen, H., | [RFC2616] Fielding, R., Gettys, J., Mogul, J., Nielsen, H., | |||
| Masinter, L., Leach, P. and T. Berners-Lee, "Hypertext | Masinter, L., Leach, P. and T. Berners-Lee, | |||
| Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. | "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, | |||
| June 1999. | ||||
| [RFC2640] Curtin, B., "Internationalization of the File Transfer | [RFC2640] Curtin, B., "Internationalization of the File | |||
| Protocol", RFC 2640, July 1999. | Transfer Protocol", RFC 2640, July 1999. | |||
| [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. Petke, | [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D. and R. | |||
| "Guidelines for new URL Schemes", RFC 2718, November | Petke, "Guidelines for new URL Schemes", RFC 2718, | |||
| 1999. | November 1999. | |||
| [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format for | [RFC2732] Hinden, R., Carpenter, B. and L. Masinter, "Format | |||
| Literal IPv6 Addresses in URL's", RFC 2732, December | for Literal IPv6 Addresses in URL's", RFC 2732, | |||
| 1999. | December 1999. | |||
| [UNIV3] The Unicode Consortium, "The Unicode Standard Version | [UNIV3] The Unicode Consortium, "The Unicode Standard Version | |||
| 3.0", Addison-Wesley, Reading, MA , 2000. | 3.0", Addison-Wesley, Reading, MA , 2000. | |||
| [UNI15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | [UNI15] Davis, M. and M. Duerst, "Unicode Normalization | |||
| Unicode Standard Annex #15, March 2001, <http:// | Forms", Unicode Standard Annex #15, March 2001, | |||
| www.unicode.org/unicode/reports/tr15/tr15-21.html>. | <http://www.unicode.org/unicode/reports/tr15/tr15- | |||
| 21.html>. | ||||
| [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other | [UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other | |||
| Markup Languages", Unicode Technical Report #20, World | Markup Languages", Unicode Technical Report #20, | |||
| Wide Web Consortium Note, Februar 2002, <http:// | World Wide Web Consortium Note, Februar 2002, <http:/ | |||
| www.w3.org/TR/unicode-xml/>. | /www.w3.org/TR/unicode-xml/>. | |||
| [W3CIRI] "Internationalization - URIs and other identifiers", | [W3CIRI] "Internationalization - URIs and other identifiers", | |||
| <http://www.w3.org/International/O-URL-and-ident.html>. | <http://www.w3.org/International/O-URL-and- | |||
| ident.html>. | ||||
| [XLink] DeRose, S., Maler, E. and D. Orchard, "XML Linking | [XLink] DeRose, S., Maler, E. and D. Orchard, "XML Linking | |||
| Language (XLink) Version 1.0", World Wide Web Consortium | Language (XLink) Version 1.0", World Wide Web | |||
| Recommendation, June 2001, <http://www.w3.org/TR/xlink/ | Consortium Recommendation, June 2001, <http:// | |||
| #link-locators>. | www.w3.org/TR/xlink/#link-locators>. | |||
| [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C. and E. Maler, | [XML1] Bray, T., Paoli, J., Sperberg-McQueen, C. and E. | |||
| "Extensible Markup Language (XML) 1.0 (Second Edition)", | Maler, "Extensible Markup Language (XML) 1.0 (Second | |||
| World Wide Web Consortium Recommendation, including | Edition)", World Wide Web Consortium Recommendation, | |||
| Erratum 26 at http://www.w3.org/XML/xml-V10-2e- | including Erratum 26 at http://www.w3.org/XML/xml- | |||
| errata#E26, October 2000, <http://www.w3.org/TR/REC- | V10-2e-errata#E26, October 2000, <http://www.w3.org/ | |||
| xml#sec-external-ent>. | TR/REC-xml#sec-external-ent>. | |||
| [XMLNamespace] Bray, T., Hollander, D. and A. Layman, "Namespaces in | ||||
| XML", World Wide Web Consortium Recommendation, | ||||
| January 1999, <http://www.w3.org/TR/REC-xml#sec- | ||||
| external-ent>. | ||||
| [XMLSchema] Biron, P. and A. Malhotra, "XML Schema Part 2: | [XMLSchema] Biron, P. and A. Malhotra, "XML Schema Part 2: | |||
| Datatypes", World Wide Web Consortium Recommendation, | Datatypes", World Wide Web Consortium Recommendation, | |||
| May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>. | May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>. | |||
| Authors' Addresses | Authors' Addresses | |||
| Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever | Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever | |||
| possible, for example as "Dürst in XML and HTML.) | possible, for example as "Dürst in XML and HTML.) | |||
| W3C/Keio University | W3C/Keio University | |||
| End of changes. | ||||
This html diff was produced by rfcdiff 1.12, available from http://www.levkowetz.com/ietf/tools/rfcdiff/ | ||||