| draft-duerst-iri-05.txt | draft-duerst-iri-06.txt | |||
|---|---|---|---|---|
| Network Working Group M. Duerst | Network Working Group M. Duerst | |||
| Internet-Draft W3C | Internet-Draft W3C | |||
| Expires: April 25, 2004 M. Suignard | Expires: August 15, 2004 M. Suignard | |||
| Microsoft Corporation | Microsoft Corporation | |||
| October 26, 2003 | February 15, 2004 | |||
| Internationalized Resource Identifiers (IRIs) | Internationalized Resource Identifiers (IRIs) | |||
| draft-duerst-iri-05 | draft-duerst-iri-06 | |||
| Status of this Memo | Status of this Memo | |||
| This document is an Internet-Draft and is in full conformance with | This document is an Internet-Draft and is in full conformance with | |||
| all provisions of Section 10 of RFC2026. | all provisions of Section 10 of RFC2026. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
| other groups may also distribute working documents as Internet- | other groups may also distribute working documents as Internet- | |||
| Drafts. | Drafts. | |||
| skipping to change at page 1, line 33 | skipping to change at page 1, line 33 | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at http:// | The list of current Internet-Drafts can be accessed at http:// | |||
| www.ietf.org/ietf/1id-abstracts.txt. | www.ietf.org/ietf/1id-abstracts.txt. | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
| This Internet-Draft will expire on April 25, 2004. | This Internet-Draft will expire on August 15, 2004. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (C) The Internet Society (2003). All Rights Reserved. | Copyright (C) The Internet Society (2004). All Rights Reserved. | |||
| Abstract | Abstract | |||
| This document defines a new protocol element, the Internationalized | This document defines a new protocol element, the Internationalized | |||
| Resource Identifier (IRI), as a complement to the URI [RFCYYYY]. An | Resource Identifier (IRI), as a complement to the URI [RFCYYYY]. An | |||
| IRI is a sequence of characters from the Universal Character Set | IRI is a sequence of characters from the Universal Character Set | |||
| [ISO10646]. A mapping from IRIs to URIs is defined, which means that | [ISO10646]. A mapping from IRIs to URIs is defined, which means that | |||
| IRIs can be used instead of URIs where appropriate to identify | IRIs can be used instead of URIs where appropriate to identify | |||
| resources. | resources. | |||
| skipping to change at page 2, line 29 | skipping to change at page 2, line 29 | |||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . . 4 | 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . . 4 | |||
| 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . . 4 | 1.2 Applicability . . . . . . . . . . . . . . . . . . . . . . . 4 | |||
| 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 5 | 1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . . 7 | 2.1 Summary of IRI Syntax . . . . . . . . . . . . . . . . . . . 7 | |||
| 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . . 7 | 2.2 ABNF for IRI References and IRIs . . . . . . . . . . . . . . 7 | |||
| 3. Relationship between IRIs and URIs . . . . . . . . . . . . . 10 | 3. Relationship between IRIs and URIs . . . . . . . . . . . . . 9 | |||
| 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . . 10 | 3.1 Mapping of IRIs to URIs . . . . . . . . . . . . . . . . . . 10 | |||
| 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . . 13 | 3.2 Converting URIs to IRIs . . . . . . . . . . . . . . . . . . 13 | |||
| 3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 14 | 3.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 14 | |||
| 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . 16 | 4. Bidirectional IRIs for Right-to-left Languages . . . . . . . 15 | |||
| 4.1 Logical Storage and Visual Presentation . . . . . . . . . . 16 | 4.1 Logical Storage and Visual Presentation . . . . . . . . . . 16 | |||
| 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . 17 | 4.2 Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . . 17 | |||
| 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . 18 | 4.3 Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . . 18 | |||
| 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 18 | 4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 18 | |||
| 5. IRI Equivalence and Comparison . . . . . . . . . . . . . . . 20 | 5. IRI Equivalence and Comparison . . . . . . . . . . . . . . . 20 | |||
| 5.1 Simple String Comparison . . . . . . . . . . . . . . . . . . 20 | 5.1 Simple String Comparison . . . . . . . . . . . . . . . . . . 20 | |||
| 5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . . 21 | 5.2 Conversion to URIs . . . . . . . . . . . . . . . . . . . . . 21 | |||
| 5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . 21 | 5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . 21 | |||
| 5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . . 22 | 5.4 Preferred Forms . . . . . . . . . . . . . . . . . . . . . . 22 | |||
| 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . 22 | 6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . 23 | |||
| 6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . . 23 | 6.1 Limitations on UCS Characters Allowed in IRIs . . . . . . . 23 | |||
| 6.2 Software Interfaces and Protocols . . . . . . . . . . . . . 23 | 6.2 Software Interfaces and Protocols . . . . . . . . . . . . . 23 | |||
| 6.3 Format of URIs and IRIs in Documents and Protocols . . . . . 23 | 6.3 Format of URIs and IRIs in Documents and Protocols . . . . . 23 | |||
| 6.4 Use of UTF-8 for Encoding Original Characters . . . . . . . 24 | 6.4 Use of UTF-8 for Encoding Original Characters . . . . . . . 24 | |||
| 6.5 Relative IRI References . . . . . . . . . . . . . . . . . . 25 | 6.5 Relative IRI References . . . . . . . . . . . . . . . . . . 25 | |||
| 7. URI/IRI Processing Guidelines (informative) . . . . . . . . 25 | 7. URI/IRI Processing Guidelines (informative) . . . . . . . . 25 | |||
| 7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . . 25 | 7.1 URI/IRI Software Interfaces . . . . . . . . . . . . . . . . 25 | |||
| 7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . . 26 | 7.2 URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . . 26 | |||
| 7.3 URI/IRI Transfer Between Applications . . . . . . . . . . . 26 | 7.3 URI/IRI Transfer Between Applications . . . . . . . . . . . 26 | |||
| 7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . . 27 | 7.4 URI/IRI Generation . . . . . . . . . . . . . . . . . . . . . 27 | |||
| skipping to change at page 7, line 50 | skipping to change at page 7, line 50 | |||
| by their transformation to URI references and URIs, they can also be | by their transformation to URI references and URIs, they can also be | |||
| accepted and processed directly. Therefore, an ABNF definition for | accepted and processed directly. Therefore, an ABNF definition for | |||
| IRI references (which are the most general concept and the start of | IRI references (which are the most general concept and the start of | |||
| the grammar) and IRIs is given here. The syntax of this ABNF is | the grammar) and IRIs is given here. The syntax of this ABNF is | |||
| described in [RFC2234]. Character numbers are taken from the UCS, | described in [RFC2234]. Character numbers are taken from the UCS, | |||
| without implying any actual binary encoding. Terminals in the ABNF | without implying any actual binary encoding. Terminals in the ABNF | |||
| are characters, not bytes. | are characters, not bytes. | |||
| The following rules are different from [RFCYYYY]: | The following rules are different from [RFCYYYY]: | |||
| IRI = scheme ":" ["//" iauthority] ipath ["?" iquery] | ||||
| ["#" ifragment] | ||||
| IRI-reference = IRI / relative-IRI | IRI-reference = IRI / relative-IRI | |||
| IRI = scheme ":" ihier-part [ "?" iquery ] [ "#" ifragment ] | relative-IRI = ["//" iauthority] ipath ["?" iquery] | |||
| absolute-IRI = scheme ":" ihier-part [ "?" iquery ] | ["#" ifragment] | |||
| relative-IRI = ihier-part [ "?" iquery ] [ "#" ifragment ] | ||||
| ihier-part = inet-path / iabs-path / irel-path | ||||
| inet-path = "//" iauthority [ iabs-path ] | ||||
| iabs-path = "/" ipath-segments | ||||
| irel-path = ipath-segments | absolute-IRI = scheme ":" ["//" iauthority] ipath ["?" iquery] | |||
| iauthority = [ iuserinfo "@" ] ihost [ ":" port ] | iauthority = [ iuserinfo "@" ] ihost [ ":" port ] | |||
| iuserinfo = *( iunreserved / escaped / ";" / | iuserinfo = *( iunreserved / pct-encoded / sub-delims | |||
| ":" / "&" / "=" / "+" / "$" / "," ) | / ":" ) | |||
| ihost = [ IPv6reference / IPv4address / ihostname ] | ||||
| ihostname = idomainlabel iqualified | ||||
| iqualified = *( "." idomainlabel ) [ "." ] | ||||
| idomainlabel = <<See following production rules>> | ihost = IP-literal / IPv4address / ireg-name | |||
| ipath-segments = ipath-segment *( "/" ipath-segment ) | ireg-name = 0*255( iunreserved / pct-encoded / sub-delims ) | |||
| ipath-segment = *ipchar | ipath = isegment *( "/" isegment ) | |||
| ipchar = iunreserved / escaped / ";" / | isegment = *ipchar | |||
| ":" / "@" / "&" / "=" / "+" / "$" / "," | ||||
| iquery = *( ipchar / iprivate / "/" / "?" ) | iquery = *( ipchar / iprivate / "/" / "?" ) | |||
| ifragment = *( ipchar / "/" / "?" ) | ifragment = *( ipchar / "/" / "?" ) | |||
| iric = reserved / iunreserved / escaped | ipchar = iunreserved / pct-encoded / sub-delims / ":" | |||
| / "@" | ||||
| iunreserved = unreserved / ucschar | iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar | |||
| ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / | ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / | |||
| / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD | / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD | |||
| / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD | / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD | |||
| / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD | / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD | |||
| / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD | / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD | |||
| / %xD0000-DFFFD / %xE1000-EFFFD | / %xD0000-DFFFD / %xE1000-EFFFD | |||
| iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD | iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD | |||
| The 'idomainlabel' production rule is as follows: | ||||
| The value 'idomainlabel' is defined as a string of 'ucschar' obeying | ||||
| the following rules: | ||||
| a) Given a string of 'ucschar' values, the ToASCII operation | ||||
| [RFC3490] is performed on that string with the flag | ||||
| UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set | ||||
| to FALSE for creating IRIs and set to TRUE otherwise. | ||||
| b) ToASCII is successful. (Note: This means that its output | Some productions ambiguous. The "first-match-wins" (a.k.a. | |||
| conforms to 'domainlabel' as defined below.) | "greedy") algorithm applies. For details, see [RFCYYYY]. | |||
| The following are the same as [RFCYYYY]: | The following are the same as [RFCYYYY]: | |||
| scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | |||
| port = *DIGIT | port = *DIGIT | |||
| domainlabel = alphanum [ 0*61( alphanum | "-" ) alphanum ] | IP-literal = "[" ( IPv6address | IPvFuture ) "]" | |||
| IPvFuture = "v" HEXDIG "." 1*( unreserved / sub-delims / ":" ) | ||||
| alphanum = ALPHA / DIGIT | ||||
| IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | ||||
| dec-octet = DIGIT ; 0-9 | ||||
| / ( %x31-39 DIGIT ) ; 10-99 | ||||
| / ( "1" 2DIGIT ) ; 100-199 | ||||
| / ( "2" %x30-34 DIGIT ) ; 200-249 | ||||
| / ( "25" %x30-35 ) ; 250-255 | ||||
| IPv6reference = "[" IPv6address "]" | ||||
| IPv6address = 6( h4 ":" ) ls32 | IPv6address = 6( h4 ":" ) ls32 | |||
| / "::" 5( h4 ":" ) ls32 | / "::" 5( h4 ":" ) ls32 | |||
| / [ h4 ] "::" 4( h4 ":" ) ls32 | / [ h4 ] "::" 4( h4 ":" ) ls32 | |||
| / [ *1( h4 ":" ) h4 ] "::" 3( h4 ":" ) ls32 | / [ *1( h4 ":" ) h4 ] "::" 3( h4 ":" ) ls32 | |||
| / [ *2( h4 ":" ) h4 ] "::" 2( h4 ":" ) ls32 | / [ *2( h4 ":" ) h4 ] "::" 2( h4 ":" ) ls32 | |||
| / [ *3( h4 ":" ) h4 ] "::" h4 ":" ls32 | / [ *3( h4 ":" ) h4 ] "::" h4 ":" ls32 | |||
| / [ *4( h4 ":" ) h4 ] "::" ls32 | / [ *4( h4 ":" ) h4 ] "::" ls32 | |||
| / [ *5( h4 ":" ) h4 ] "::" h4 | / [ *5( h4 ":" ) h4 ] "::" h4 | |||
| / [ *6( h4 ":" ) h4 ] "::" | / [ *6( h4 ":" ) h4 ] "::" | |||
| h4 = 1*4HEXDIG | h4 = 1*4HEXDIG | |||
| ls32 = ( h4 ":" h4 ) / IPv4address | ls32 = ( h4 ":" h4 ) / IPv4address | |||
| reserved = "/" / "?" / "#" / "[" / "]" / ";" / | IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | |||
| ":" / "@" / "&" / "=" / "+" / "$" / "," | ||||
| unreserved = ALPHA / DIGIT / mark | ||||
| mark = "-" / "_" / "." / "!" / "~" / "*" / "'" / | dec-octet = DIGIT ; 0-9 | |||
| "(" / ")" | / %x31-39 DIGIT ; 10-99 | |||
| / "1" 2DIGIT ; 100-199 | ||||
| / "2" %x30-34 DIGIT ; 200-249 | ||||
| / "25" %x30-35 ; 250-255 | ||||
| escaped = "%" HEXDIG HEXDIG | pct-encoded = "%" HEXDIG HEXDIG | |||
| unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" | ||||
| reserved = gen-delims / sub-delims | ||||
| gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" | ||||
| sub-delims = "!" / "$" / "&" / "'" / "(" / ")" | ||||
| / "*" / "+" / "," / ";" / "=" | ||||
| 3. Relationship between IRIs and URIs | 3. Relationship between IRIs and URIs | |||
| IRIs are meant to replace URIs in identifying resources for | IRIs are meant to replace URIs in identifying resources for | |||
| protocols, formats and software components which use a UCS-based | protocols, formats and software components which use a UCS-based | |||
| character repertoire. These protocols and components may never need | character repertoire. These protocols and components may never need | |||
| to use URIs directly, especially when the resource identifier is used | to use URIs directly, especially when the resource identifier is used | |||
| simply for identification purposes. However, when the resource | simply for identification purposes. However, when the resource | |||
| identifier is used for resource retrieval, it is in many cases | identifier is used for resource retrieval, it is in many cases | |||
| necessary to determine the associated URI because most retrieval | necessary to determine the associated URI because most retrieval | |||
| skipping to change at page 11, line 22 | skipping to change at page 10, line 51 | |||
| Variant B) If the IRI is in some digital representation | Variant B) If the IRI is in some digital representation | |||
| (e.g. an octet stream) in some known non-Unicode | (e.g. an octet stream) in some known non-Unicode | |||
| encoding: Convert the IRI to a sequence of characters | encoding: Convert the IRI to a sequence of characters | |||
| from the UCS normalized according to NFC. | from the UCS normalized according to NFC. | |||
| Variant C) If the IRI is in an Unicode-based encoding (for | Variant C) If the IRI is in an Unicode-based encoding (for | |||
| example UTF-8 or UTF-16): Do not normalize. Move | example UTF-8 or UTF-16): Do not normalize. Move | |||
| directly to Step 2. | directly to Step 2. | |||
| Step 2) If the IRI contains an 'ihostname' part, replace this | Step 2) For each character that is disallowed in URI references, | |||
| 'ihostname' part by the part converted using the ToASCII | ||||
| operation specified in Section 4.1 of [RFC3490], with the flag | ||||
| UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set | ||||
| to FALSE for creating IRIs and set to TRUE otherwise. The | ||||
| ToASCII operation may fail, but only if the IRI does not | ||||
| conform to the rules in Section 2.2. | ||||
| Step 3) For each character that is disallowed in URI references, | ||||
| apply steps 1) through 3) below. The disallowed characters | apply steps 1) through 3) below. The disallowed characters | |||
| consist of all non-ASCII characters allowed in IRIs. | consist of all non-ASCII characters allowed in IRIs. | |||
| 1) Convert the character to a sequence of one or more octets | 1) Convert the character to a sequence of one or more octets | |||
| using UTF-8 [RFCXXXX]. | using UTF-8 [RFC3629]. | |||
| 2) Convert each octet to %HH, where HH is the hexadecimal | 2) Convert each octet to %HH, where HH is the hexadecimal | |||
| notation of the octet value. Note: This is identical to | notation of the octet value. Note: This is identical to | |||
| the escaping mechanism in Section 2.4.1 of [RFCYYYY]. To | the escaping mechanism in Section 2.4.1 of [RFCYYYY]. To | |||
| reduce variability, the hexadecimal notation SHOULD use | reduce variability, the hexadecimal notation SHOULD use | |||
| upper case letters. | upper case letters. | |||
| 3) Replace the original character by the resulting character | 3) Replace the original character by the resulting character | |||
| sequence (i.e. a sequence of %HH triplets). | sequence (i.e. a sequence of %HH triplets). | |||
| The above mapping from IRIs to URIs produces URIs fully conforming to | The above mapping from IRIs to URIs produces URIs fully conforming to | |||
| [RFCYYYY]. The mapping is also an identity transformation for URIs | [RFCYYYY]. The mapping is also an identity transformation for URIs | |||
| and is idempotent -- applying the mapping a second time will not | and is idempotent -- applying the mapping a second time will not | |||
| change anything. Every URI is by definition an IRI. | change anything. Every URI is by definition an IRI. | |||
| Infrastructure accepting IRIs MAY also deal with 'ihostname' parts | Infrastructure accepting IRIs MAY also convert the ireg-name | |||
| escaped according to Step 3) rather than Step 2). For example, Step | component of an IRI as follows (before step 2 above) if it knows that | |||
| 2) converts the IRI | the scheme in question uses domain names: Replace the iregname part | |||
| http://résumé.example.org to | of the IRI by the part converted using the ToASCII operation | |||
| http://xn--rsum-bpad.example.org. For backward compatibility, | specified in Section 4.1 of [RFC3490], with the flag | |||
| http://r%C3%A9sum%C3%A9.example.org would also be converted to | UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set to | |||
| http://xn--rsum-bpad.example.org. | FALSE for creating IRIs and set to TRUE otherwise. The ToASCII | |||
| operation may fail, but this would mean that the IRI cannot be | ||||
| resolved. For example, the IRI | ||||
| http://résumé.example.org may be converted to | ||||
| http://xn--rsum-bpad.example.org instead of | ||||
| http://r%C3%A9sum%C3%A9.example.org. | ||||
| Infrastructure accepting IRIs MAY also deal with the printable | Note: The uniform treatment of the whole IRI in step 2) above is | |||
| characters in US-ASCII that are not allowed in URIs, namely "<", ">", | important to not make processing dependent on URI scheme. See | |||
| '"', Space, "{", "}", "|", "\", "^", and "`", in step 3) above. If | [Gettys] for an in-depth discussion. | |||
| such characters are found but are not converted, then the conversion | ||||
| SHOULD fail. Please note that the number sign ("#"), the percent | ||||
| sign ("%"), and the square bracket characters ("[", "]") are not part | ||||
| of the above list, and MUST NOT be converted. Protocols and formats | ||||
| that have used earlier definitions of IRIs including these characters | ||||
| MAY require unescaping of these characters as a preprocessing step to | ||||
| extract the actual IRI from a given field. Such preprocessing MAY | ||||
| also be used by applications allowing the user to enter an IRI. | ||||
| Internationalized Domain Names may be contained in parts of an | Note: In practice, the difference above will not be noticed if | |||
| IRI other than the 'ihostname' part. In this case, Step 2) is | mapping from IRI to URI and resolution is tightly integrated | |||
| not used, but Step 3) is applied. This is important to | (e.g. carried out in the same user agent). But conversion | |||
| maintain uniform treatment of URIs. See [Gettys] for an in- | using [RFC3490] may be able to better deal with backwards | |||
| depth discussion. It is the responsibility of scheme-specific | compatibility issues in case mapping and resolution are | |||
| separated, as in the case of using an HTTP proxy. | ||||
| Note: Internationalized Domain Names may be contained in parts of | ||||
| an IRI other than the ireg-name part. It is the responsibility | ||||
| of scheme-specific implementations (if the Internationalized | ||||
| Domain Name is part of the scheme syntax) or of server-side | ||||
| implementations (if the Internationalized Domain Name is part | implementations (if the Internationalized Domain Name is part | |||
| of the scheme syntax) or of server-side implementations (if the | of 'iquery') to apply the necessary conversions at the | |||
| Internationalized Domain Name is part of 'iquery') to apply the | appropriate point. Example: Trying to validate the Web page at | |||
| necessary conversions at the appropriate point. Example: | ||||
| Trying to validate the Web page at | ||||
| http://résumé.example.org would lead to an IRI of | http://résumé.example.org would lead to an IRI of | |||
| http://validator.w3.org/ | http://validator.w3.org/ | |||
| check?uri=http%3A%2F%2Frésumé.example.org, which | check?uri=http%3A%2F%2Frésumé.example.org, which | |||
| would convert to a URI of | would convert to a URI of | |||
| http://validator.w3.org/ | http://validator.w3.org/ | |||
| check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.example.org. The | check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.example.org. The | |||
| server side implementation would be responsible to do the | server side implementation would be responsible to do the | |||
| necessary conversions in order to be able to retrieve the Web | necessary conversions in order to be able to retrieve the Web | |||
| page. | page. | |||
| In this process (in step 3.3), characters allowed in URI | Infrastructure accepting IRIs MAY also deal with the printable | |||
| characters in US-ASCII that are not allowed in URIs, namely "<", ">", | ||||
| '"', Space, "{", "}", "|", "\", "^", and "`", in step 2) above. If | ||||
| such characters are found but are not converted, then the conversion | ||||
| SHOULD fail. Please note that the number sign ("#"), the percent | ||||
| sign ("%"), and the square bracket characters ("[", "]") are not part | ||||
| of the above list, and MUST NOT be converted. Protocols and formats | ||||
| that have used earlier definitions of IRIs including these characters | ||||
| MAY require unescaping of these characters as a preprocessing step to | ||||
| extract the actual IRI from a given field. Such preprocessing MAY | ||||
| also be used by applications allowing the user to enter an IRI. | ||||
| Note: In this process (in step 2.3), characters allowed in URI | ||||
| references as well as existing escape sequences are not escaped | references as well as existing escape sequences are not escaped | |||
| further. (This mapping is similar to, but different from, the | further. (This mapping is similar to, but different from, the | |||
| escaping applied when including arbitrary content into some | escaping applied when including arbitrary content into some | |||
| part of a URI.) For example, an IRI of | part of a URI.) For example, an IRI of | |||
| http://www.example.org/red%09rosé#red (in XML notation) is | http://www.example.org/red%09rosé#red (in XML notation) is | |||
| converted to | converted to | |||
| http://www.example.org/red%09ros%C3%A9#red, not to something | http://www.example.org/red%09ros%C3%A9#red, not to something | |||
| like | like | |||
| http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red. | http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red. | |||
| Some older software transcoding to UTF-8 may produce illegal | Note: Some older software transcoding to UTF-8 may produce illegal | |||
| output for some input, in particular for characters outside the | output for some input, in particular for characters outside the | |||
| BMP (Basic Multilingual Plane). As an example, for the | BMP (Basic Multilingual Plane). As an example, for the | |||
| following IRI with non-BMP characters (in XML Notation): | following IRI with non-BMP characters (in XML Notation): | |||
| http://example.com/𐌀𐌁𐌂 | http://example.com/𐌀𐌁𐌂 | |||
| (the first three letters of the Old Italic alphabet) the | (the first three letters of the Old Italic alphabet) the | |||
| correct conversion to a URI is: | correct conversion to a URI is: | |||
| http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82 | http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82 | |||
| 3.2 Converting URIs to IRIs | 3.2 Converting URIs to IRIs | |||
| skipping to change at page 13, line 48 | skipping to change at page 13, line 38 | |||
| discussion, see [Duerst97].) | discussion, see [Duerst97].) | |||
| c) The conversion may result in a character that is not | c) The conversion may result in a character that is not | |||
| appropriate in an IRI. See Section 6.1 for further details. | appropriate in an IRI. See Section 6.1 for further details. | |||
| Conversion from a URI to an IRI is done using the following steps (or | Conversion from a URI to an IRI is done using the following steps (or | |||
| any other algorithm that produces the same result): | any other algorithm that produces the same result): | |||
| 1) Represent the URI as a sequence of octets in US-ASCII. | 1) Represent the URI as a sequence of octets in US-ASCII. | |||
| 2) Apply the ToUnicode operation to each 'domainlabel' in the | 2) Convert all hexadecimal escapes (% followed by two hexadecimal | |||
| 'hostname' part (if there is one), representing the output as | ||||
| UTF-8. | ||||
| 3) Convert all hexadecimal escapes (% followed by two hexadecimal | ||||
| digits) except those corresponding to '%', characters in | digits) except those corresponding to '%', characters in | |||
| 'reserved', and characters in US-ASCII not allowed in URIs, to | 'reserved', and characters in US-ASCII not allowed in URIs, to | |||
| the corresponding octets. | the corresponding octets. | |||
| 4) Re-escape any octet produced in step 3) that is not part of a | 3) Re-escape any octet produced in step 2) that is not part of a | |||
| strictly legal UTF-8 octet sequence. | strictly legal UTF-8 octet sequence. | |||
| 5) Re-escape all octets produced in step 3) that in UTF-8 | 4) Re-escape all octets produced in step 3) that in UTF-8 | |||
| represent characters that are not appropriate according to | represent characters that are not appropriate according to | |||
| Section 4.1 and Section 6.1. | Section 4.1 and Section 6.1. | |||
| 6) Interpret the resulting octet sequence as a sequence of | 5) Interpret the resulting octet sequence as a sequence of | |||
| characters encoded in UTF-8. | characters encoded in UTF-8. | |||
| This procedure will convert as many escaped non-ASCII characters as | This procedure will convert as many escaped non-ASCII characters as | |||
| possible to characters in an IRI. Because there are some choices | possible to characters in an IRI. Because there are some choices | |||
| when applying step 5) (see Section 6.1), results may vary. | when applying step 4) (see Section 6.1), results may vary. | |||
| Conversions from URIs to IRIs MUST NOT use any other encoding than | Conversions from URIs to IRIs MUST NOT use any other encoding than | |||
| UTF-8 in steps 2), 4) and 5) above, even if it might be possible from | UTF-8 in steps 3) and 4) above, even if it might be possible from | |||
| context to guess that another encoding than UTF-8 was used in the | context to guess that another encoding than UTF-8 was used in the | |||
| URI. As an example, the URI http://www.example.org/r%E9sum%E9.html | URI. As an example, the URI http://www.example.org/r%E9sum%E9.html | |||
| might with some guessing be interpreted to contain two e-acute | might with some guessing be interpreted to contain two e-acute | |||
| characters encoded as iso-8859-1. It must not be converted to an IRI | characters encoded as iso-8859-1. It must not be converted to an IRI | |||
| containing these e-acute characters. Otherwise, the IRI will in the | containing these e-acute characters. Otherwise, the IRI will in the | |||
| future be mapped to http://www.example.org/r%C3%A9sum%C3%A9.html, | future be mapped to http://www.example.org/r%C3%A9sum%C3%A9.html, | |||
| which is a different URI from http://www.example.org/r%E9sum%E9.html. | which is a different URI than http://www.example.org/r%E9sum%E9.html. | |||
| 3.2.1 Examples | 3.2.1 Examples | |||
| This section shows various examples of converting URIs to IRIs. The | This section shows various examples of converting URIs to IRIs. The | |||
| notation <hh> is used to denote octets outside those that can be | notation <hh> is used to denote octets outside those that can be | |||
| represented in this document. Each example shows the result after | represented in this document. Each example shows the result after | |||
| applying each of the steps 1) to 6). XML Notation is used for the | applying each of the steps 1) to 5). XML Notation is used for the | |||
| final result. | final result. | |||
| The following example contains the sequence '%C3%BC', which is a | The following example contains the sequence '%C3%BC', which is a | |||
| strictly legal UTF-8 sequence, and which is converted into the actual | strictly legal UTF-8 sequence, and which is converted into the actual | |||
| character U+00FC LATIN SMALL LETTER U WITH DIAERESIS (also known as | character U+00FC LATIN SMALL LETTER U WITH DIAERESIS (also known as | |||
| u-umlaut). | u-umlaut). | |||
| 1) http://www.example.org/D%C3%BCrst | 1) http://www.example.org/D%C3%BCrst | |||
| 2) http://www.example.org/D%C3%BCrst | 2) http://www.example.org/D<c3><bc>rst | |||
| 3) http://www.example.org/D<c3><bc>rst | 3) http://www.example.org/D<c3><bc>rst | |||
| 4) http://www.example.org/D<c3><bc>rst | ||||
| 5) http://www.example.org/D<c3><bc>rst | 4) http://www.example.org/D<c3><bc>rst | |||
| 6) http://www.example.org/Dürst | 5) http://www.example.org/Dürst | |||
| The following example contains the sequence '%FC', which might | The following example contains the sequence '%FC', which might | |||
| represent U+00FC LATIN SMALL LETTER U WITH DIAERESIS in the | represent U+00FC LATIN SMALL LETTER U WITH DIAERESIS in the | |||
| iso-8859-1 encoding. (It might represent other characters in other | iso-8859-1 encoding. (It might represent other characters in other | |||
| encodings. For example, the octet <fc> in iso-8859-5 represents | encodings. For example, the octet <fc> in iso-8859-5 represents | |||
| U+045C CYRILLIC SMALL LETTER KJE.) Because <fc> is not part of a | U+045C CYRILLIC SMALL LETTER KJE.) Because <fc> is not part of a | |||
| strictly legal UTF-8 sequence, it is re-escaped in step 2). | strictly legal UTF-8 sequence, it is re-escaped in step 3). | |||
| 1) http://www.example.org/D%FCrst | 1) http://www.example.org/D%FCrst | |||
| 2) http://www.example.org/D%FCrst | 2) http://www.example.org/D<fc>rst | |||
| 3) http://www.example.org/D%FCrst | ||||
| 3) http://www.example.org/D<fc>rst | ||||
| 4) http://www.example.org/D%FCrst | 4) http://www.example.org/D%FCrst | |||
| 5) http://www.example.org/D%FCrst | 5) http://www.example.org/D%FCrst | |||
| 6) http://www.example.org/D%FCrst | ||||
| The following example contains '%e2%80%ae', which is the escaped | The following example contains '%e2%80%ae', which is the escaped | |||
| UTF-8 encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. Section 4.1 | UTF-8 encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. Section 4.1 | |||
| forbids the direct use of this character in an IRI. Therefore, the | forbids the direct use of this character in an IRI. Therefore, the | |||
| corresponding octets are re-escaped in step 5). This example shows | corresponding octets are re-escaped in step 4). This example shows | |||
| that the case (upper or lower) of letters used in escapes may not be | that the case (upper or lower) of letters used in escapes may not be | |||
| preserved. The example also contains a punycode-encoded domain name | preserved. The example also contains a punycode-encoded domain name | |||
| label (xn--99zt52a), which is converted to the corresponding | label (xn--99zt52a), which is not converted. | |||
| characters U+7D0D U+8C46 (Japanese Natto). | ||||
| 1) http://xn--99zt52a.example.org/%e2%80%ae | 1) http://xn--99zt52a.example.org/%e2%80%ae | |||
| 2) http://<e7><b4><8d><e8><b1><86>.example.org/%e2%80%ae | 2) http://xn--99zt52a.example.org/<e2><80><ae> | |||
| 3) http://<e7><b4><8d><e8><b1><86>.example.org/<e2><80><ae> | 3) http://xn--99zt52a.example.org/<e2><80><ae> | |||
| 4) http://<e7><b4><8d><e8><b1><86>.example.org/<e2><80><ae> | 4) http://xn--99zt52a.example.org/%E2%80%AE | |||
| 5) http://<e7><b4><8d><e8><b1><86>.example.org/%E2%80%AE | 5) http://xn--99zt52a.example.org/%E2%80%AE | |||
| 6) http://納豆.example.org/%E2%80%AE | Implementations with scheme-specific knowledge MAY convert punycode- | |||
| encoded domain name labels to the corresponding characters using the | ||||
| ToUnicode procedure. Thus, for the example above, the label xn-- | ||||
| 99zt52a may be converted to U+7D0D U+8C46 (Japanese Natto), leading | ||||
| to the overall IRI of | ||||
| http://納豆.example.org/%E2%80%AE | ||||
| 4. Bidirectional IRIs for Right-to-left Languages | 4. Bidirectional IRIs for Right-to-left Languages | |||
| Some UCS characters, such as those used in the Arabic and Hebrew | Some UCS characters, such as those used in the Arabic and Hebrew | |||
| script, have an inherent right-to-left (rtl) writing direction. IRIs | script, have an inherent right-to-left (rtl) writing direction. IRIs | |||
| containing such characters (called bidirectional IRIs or Bidi IRIs) | containing such characters (called bidirectional IRIs or Bidi IRIs) | |||
| require additional attention because of the non-trivial relation | require additional attention because of the non-trivial relation | |||
| between logical representation (used for digital representation as | between logical representation (used for digital representation as | |||
| well as when reading/spelling) and visual representation (used for | well as when reading/spelling) and visual representation (used for | |||
| display/printing). | display/printing). | |||
| skipping to change at page 16, line 38 | skipping to change at page 16, line 19 | |||
| 4.1 Logical Storage and Visual Presentation | 4.1 Logical Storage and Visual Presentation | |||
| When stored or transmitted in digital representation, bidirectional | When stored or transmitted in digital representation, bidirectional | |||
| IRIs MUST be in full logical order, and MUST conform to the IRI | IRIs MUST be in full logical order, and MUST conform to the IRI | |||
| syntax rules (which includes the rules relevant to their scheme). | syntax rules (which includes the rules relevant to their scheme). | |||
| This assures that bidirectional IRIs can be processed in the same way | This assures that bidirectional IRIs can be processed in the same way | |||
| as other IRIs. | as other IRIs. | |||
| When rendered, bidirectional IRIs MUST be rendered using the Unicode | When rendered, bidirectional IRIs MUST be rendered using the Unicode | |||
| Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be | Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be | |||
| rendered with an overall left-to-right (ltr) direction. | rendered in the same way as they would be rendered if they were in an | |||
| left-to-right embedding, i.e. as if they were preceded by U+202A, | ||||
| LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP | ||||
| DIRECTIONAL FORMATTING (PDF). Setting the embedding direction can | ||||
| also be done in a higher-order protocol (e.g. the dir='ltr' | ||||
| attribute in HTML). | ||||
| In text with a left-to-right base directionality or embedding (such | There is no requirement to actually use the above embedding if the | |||
| as used for English or Cyrillic), the Unicode Bidirectional Algorithm | display is still the same without the embedding. For example, a | |||
| will automatically use an overall ltr direction for the IRI. In text | bidirectional IRI in a text with left-to-right base directionality | |||
| with a rtl base directionality or embedding (such as used for Arabic | (such as used for English or Cyrillic) that is preceded and followed | |||
| or Hebrew), setting a different embedding direction for the IRI is | by whitespace and strong left-to-right characters does not need an | |||
| needed. Setting the embedding direction can be done in a higher- | embedding. Also, a bidirectional relative IRI that only contains | |||
| order protocol (e.g. the dir='ltr' attribute in HTML). If this is | strong right-to-left characters and weak characters and that starts | |||
| not available (e.g. in plain text), setting the embedding is done | and ends with a strong rigth-to-left character and appears in a text | |||
| with Unicode bidi formatting codes, i.e. U+202A, LEFT-TO-RIGHT | with right-to-left base directionality (such as used for Arabic or | |||
| EMBEDDING (LRE) before the IRI, and U+202C, POP DIRECTIONAL | Hebrew) and is preceded and followed by whitespace and strong | |||
| FORMATTING (PDF) after the IRI, both not being part of the IRI | characters does not need an embedding. | |||
| itself. | ||||
| IRIs MUST NOT contain bidirectional formatting characters (LRM, RLM, | In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM) may be | |||
| LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of | sufficient to force the correct display behavior. However, the | |||
| the IRI, but do not themselves appear visually. It would therefore | details of the Unicode Bidirectional algorithm are not always easy to | |||
| not be possible to correctly input an IRI with such characters. | understand. Implementers are strongly advised to err on the side of | |||
| caution and to use embedding in all cases where they are not | ||||
| completely sure that the display behavior is unaffected without the | ||||
| embedding. | ||||
| The Unicode Bidirectional Algorithm ([UNI9], Section 4.3) permits | ||||
| higher-level protocols to influence bidirectional rendering. Such | ||||
| changes by higher-level protocols MUST NOT be used if they change the | ||||
| rendering of IRIs. | ||||
| The bidirectional formatting characters that may be used before or | ||||
| after the IRI to assure correct display are themselves not part of | ||||
| the IRI. IRIs MUST NOT contain bidirectional formatting characters | ||||
| (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual | ||||
| rendering of the IRI, but do not themselves appear visually. It | ||||
| would therefore not be possible to correctly input an IRI with such | ||||
| characters. | ||||
| 4.2 Bidi IRI Structure | 4.2 Bidi IRI Structure | |||
| The Unicode Bidirectional Algorithm is designed mainly for running | The Unicode Bidirectional Algorithm is designed mainly for running | |||
| text. To make sure that it does not affect the rendering of | text. To make sure that it does not affect the rendering of | |||
| bidirectional IRIs too much, some restrictions on bidirectional IRIs | bidirectional IRIs too much, some restrictions on bidirectional IRIs | |||
| are necessary. These restrictions are given in terms of delimiters | are necessary. These restrictions are given in terms of delimiters | |||
| (structural characters, mostly punctuation such as '@', '.', ':', | (structural characters, mostly punctuation such as '@', '.', ':', | |||
| '/') and components (usually consisting mostly of letters and | '/') and components (usually consisting mostly of letters and | |||
| digits). | digits). | |||
| The following syntax rules from Section 2.2 correspond to components | The following syntax rules from Section 2.2 correspond to components | |||
| for the purpose of Bidi behavior: iuserinfo, ipath-segment, | for the purpose of Bidi behavior: iuserinfo, isegment, ireg-name, | |||
| ihostname, iquery, and ifragment. | iquery, and ifragment. | |||
| Specifications that define the syntax of any of the above components | Specifications that define the syntax of any of the above components | |||
| MAY divide them further and define smaller parts to be components | MAY divide them further and define smaller parts to be components | |||
| according to this document. As an example, the restrictions of | according to this document. As an example, the restrictions of | |||
| [RFC3490] on bidirectional domain names correspond to treating each | [RFC3490] on bidirectional domain names correspond to treating each | |||
| label of the domain name as a component. Even where the components | label of a domain name as a component for those schemes where ireg- | |||
| are not defined formally, it may be helpful to think about some | name is a domain name. Even where the components are not defined | |||
| syntax in terms of components and to apply the relevant restrictions. | formally, it may be helpful to think about some syntax in terms of | |||
| For example, for the usual name/value syntax in query parts, it is | components and to apply the relevant restrictions. For example, for | |||
| convenient to treat each name and each value as a component. As | the usual name/value syntax in query parts, it is convenient to treat | |||
| another example, the extensions in a resource name can be treated as | each name and each value as a component. As another example, the | |||
| separate components. | extensions in a resource name can be treated as separate components. | |||
| For each component, the following restrictions apply: | For each component, the following restrictions apply: | |||
| 1) A component SHOULD NOT not use both right-to-left and left-to- | 1) A component SHOULD NOT not use both right-to-left and left-to- | |||
| right characters. | right characters. | |||
| 2) A component using right-to-left characters SHOULD start and end | 2) A component using right-to-left characters SHOULD start and end | |||
| with right-to-left characters. | with right-to-left characters. | |||
| The above restrictions are given as shoulds, rather than as musts. | The above restrictions are given as shoulds, rather than as musts. | |||
| For IRIs that are never presented visually, they are not relevant. | For IRIs that are never presented visually, they are not relevant. | |||
| However, for IRIs in general, they are very important to insure | However, for IRIs in general, they are very important to insure | |||
| consistent conversion between visual presentation and logical | consistent conversion between visual presentation and logical | |||
| representation, in both directions. | representation, in both directions. | |||
| In some components, the above restrictions may actually be | Note: In some components, the above restrictions may actually be | |||
| strictly enforced. For example, [RFC3490] requires that these | strictly enforced. For example, [RFC3490] requires that these | |||
| restrictions apply to the labels of the host name part of an | restrictions apply to the labels of a host name for those | |||
| IRI. In some other components, for example path components, | schemes where ireg-name is a host name. In some other | |||
| following these restrictions may not be too difficult. For | components, for example path components, following these | |||
| other components, such as parts of the query part, it may be | restrictions may not be too difficult. For other components, | |||
| very difficult to enforce the restrictions, because the values | such as parts of the query part, it may be very difficult to | |||
| of query parameters may be arbitrary character sequences. | enforce the restrictions, because the values of query | |||
| parameters may be arbitrary character sequences. | ||||
| If the above restrictions cannot be satisfied otherwise, the affected | If the above restrictions cannot be satisfied otherwise, the affected | |||
| component can always be mapped to URI notation as described in | component can always be mapped to URI notation as described in | |||
| Section 3.1. Please note that the whole component needs to be mapped | Section 3.1. Please note that the whole component needs to be mapped | |||
| (see also Example 9 below). | (see also Example 9 below). | |||
| 4.3 Input of Bidi IRIs | 4.3 Input of Bidi IRIs | |||
| Bidi input methods MUST generate Bidi IRIs in logical order while | Bidi input methods MUST generate Bidi IRIs in logical order while | |||
| rendering them according to Section 4.1. During input, rendering | rendering them according to Section 4.1. During input, rendering | |||
| skipping to change at page 22, line 24 | skipping to change at page 22, line 28 | |||
| provided instead of a direct negative result). The best recipe | provided instead of a direct negative result). The best recipe | |||
| is that the generator uses a reasonable capitalization, and | is that the generator uses a reasonable capitalization, and | |||
| when transfering the URI, that capitalization is never changed. | when transfering the URI, that capitalization is never changed. | |||
| Various IRI schemes may allow the usage of International Domain Names | Various IRI schemes may allow the usage of International Domain Names | |||
| (IDN) [RFC3490]. When in use in IRIs, those names SHOULD be | (IDN) [RFC3490]. When in use in IRIs, those names SHOULD be | |||
| validated using the ToASCII operation defined in [RFC3490], with the | validated using the ToASCII operation defined in [RFC3490], with the | |||
| flags "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing | flags "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing | |||
| an invalid IDN cannot successfully be resolved. For legibility | an invalid IDN cannot successfully be resolved. For legibility | |||
| purposes, IDN components of IRIs SHOULD NOT be converted into ASCII | purposes, IDN components of IRIs SHOULD NOT be converted into ASCII | |||
| Compatible Encoding (ACE). However, this conversion is applied when | Compatible Encoding (ACE). | |||
| mapping an IRI into a URI, see Section 3.1. | ||||
| 5.4 Preferred Forms | 5.4 Preferred Forms | |||
| The following are the preferred forms for IRIs when generated: | The following are the preferred forms for IRIs when generated: | |||
| - Always provide the URI scheme in lowercase characters. | - Always provide the URI scheme in lowercase characters. | |||
| - Only perform percent-escaping where it is essential. | - Only perform percent-escaping where it is essential. | |||
| - Always use uppercase A-through-F characters when percent- | - Always use uppercase A-through-F characters when percent- | |||
| skipping to change at page 22, line 47 | skipping to change at page 22, line 50 | |||
| - Always provide the hostname, if any, in the form produced when | - Always provide the hostname, if any, in the form produced when | |||
| applying nameprep [RFC3491]. This in particular includes using | applying nameprep [RFC3491]. This in particular includes using | |||
| lowercase characters rather than uppercase characters where | lowercase characters rather than uppercase characters where | |||
| applicable. | applicable. | |||
| - Where possible, provide IRI components in NFKC or NFC. | - Where possible, provide IRI components in NFKC or NFC. | |||
| - Prevent /./ and /../ from appearing in non-relative URI paths. | - Prevent /./ and /../ from appearing in non-relative URI paths. | |||
| - For schemes that define an empty path to be equivalent to a | ||||
| path of "/", use "/". | ||||
| 6. Use of IRIs | 6. Use of IRIs | |||
| 6.1 Limitations on UCS Characters Allowed in IRIs | 6.1 Limitations on UCS Characters Allowed in IRIs | |||
| This section discusses limitations on characters and character | This section discusses limitations on characters and character | |||
| sequences usable for IRIs. The considerations in this section are | sequences usable for IRIs. The considerations in this section are | |||
| relevant when creating IRIs and when converting from URIs to IRIs. | relevant when creating IRIs and when converting from URIs to IRIs. | |||
| a) The repertoire of characters allowed in each IRI component is | a) The repertoire of characters allowed in each IRI component is | |||
| limited by the definition of that component. For example, the | limited by the definition of that component. For example, the | |||
| skipping to change at page 24, line 45 | skipping to change at page 24, line 47 | |||
| For example, for a document with a URI of | For example, for a document with a URI of | |||
| http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to | http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to | |||
| construct a corresponding IRI (in XML notation, see Section 1.4): | construct a corresponding IRI (in XML notation, see Section 1.4): | |||
| http://www.example.org/résumé.html (é stands for the | http://www.example.org/résumé.html (é stands for the | |||
| e-acute character, and %C3%A9 is the UTF-8 encoded and escaped | e-acute character, and %C3%A9 is the UTF-8 encoded and escaped | |||
| representation of that character). On the other hand, for a document | representation of that character). On the other hand, for a document | |||
| with a URI of http://www.example.org/r%E9sum%E9.html, the escaped | with a URI of http://www.example.org/r%E9sum%E9.html, the escaped | |||
| octets cannot be converted to actual characters in an IRI, because | octets cannot be converted to actual characters in an IRI, because | |||
| the escaping is not based on UTF-8. | the escaping is not based on UTF-8. | |||
| The requirement for the use of UTF-8 applies to all parts of a URI, | The requirement for the use of UTF-8 applies to all parts of a URI. | |||
| with the exception of the ihostname part. However, it is possible | However, it is possible that the capability of IRIs to represent a | |||
| that the capability of IRIs to represent a wide range of characters | wide range of characters directly is used just in some parts of the | |||
| directly is used just in some parts of the IRI (or IRI reference). | IRI (or IRI reference). The other parts of the IRI may only contain | |||
| The other parts of the IRI may only contain ASCII characters, or they | ASCII characters, or they may not be based on UTF-8. They may be | |||
| may not be based on UTF-8. They may be based on another encoding, or | based on another encoding, or they may directly encode raw binary | |||
| they may directly encode raw binary data (see also [RFC2397]). | data (see also [RFC2397]). | |||
| For example, it is possible to have a URI reference of | For example, it is possible to have a URI reference of | |||
| http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9, where the | http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9, where the | |||
| document name is encoded in iso-8859-1 based on server settings, but | document name is encoded in iso-8859-1 based on server settings, but | |||
| the fragment identifier is encoded in UTF-8 according to [XPointer]. | the fragment identifier is encoded in UTF-8 according to [XPointer]. | |||
| The IRI corresponding to the above URI would be (in XML notation) | The IRI corresponding to the above URI would be (in XML notation) | |||
| http://www.example.org/r%E9sum%E9.xml#résumé. | http://www.example.org/r%E9sum%E9.xml#résumé. | |||
| Similar considerations apply to query parts. The functionality of | Similar considerations apply to query parts. The functionality of | |||
| IRIs (namely to be able to include non-ASCII characters) can only be | IRIs (namely to be able to include non-ASCII characters) can only be | |||
| skipping to change at page 32, line 27 | skipping to change at page 32, line 27 | |||
| [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, | [RFC3490] Faltstrom, P., Hoffman, P. and A. Costello, | |||
| "Internationalizing Domain Names in Applications (IDNA)", | "Internationalizing Domain Names in Applications (IDNA)", | |||
| RFC 3490, March 2003, <http://www.ietf.org/rfc/ | RFC 3490, March 2003, <http://www.ietf.org/rfc/ | |||
| rfc3490.txt>. | rfc3490.txt>. | |||
| [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep | |||
| Profile for Internationalized Domain Names (IDN)", RFC | Profile for Internationalized Domain Names (IDN)", RFC | |||
| 3491, March 2003. | 3491, March 2003. | |||
| [RFCXXXX] Yergeau, F., "UTF-8, a transformation format of ISO | [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO | |||
| 10646", draft-yergeau-rfc2279bis-05.txt (work in | 10646", STD 63, RFC 3629, November 2003, <http:// | |||
| progress), June 2003, <http://www.ietf.org/internet- | www.ietf.org/rfc/rfc3629.txt>. | |||
| drafts/draft-yergeau-rfc2279bis-05.txt>. | ||||
| [RFCYYYY] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform | [RFCYYYY] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform | |||
| Resource Identifier (URI): Generic Syntax", draft- | Resource Identifier (URI): Generic Syntax", draft- | |||
| fielding-uri-rfc2396bis-03.txt (work in progress), June | fielding-uri-rfc2396bis-03.txt (work in progress), June | |||
| 2003. | 2003. | |||
| [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | [UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", | |||
| Unicode Standard Annex #15, March 2001, <http:// | Unicode Standard Annex #15, March 2001, <http:// | |||
| www.unicode.org/unicode/reports/tr15/tr15-21.html>. | www.unicode.org/unicode/reports/tr15/tr15-21.html>. | |||
| skipping to change at page 36, line 7 | skipping to change at page 36, line 7 | |||
| One Microsoft Way | One Microsoft Way | |||
| Redmond, WA 98052 | Redmond, WA 98052 | |||
| U.S.A. | U.S.A. | |||
| Phone: +1 425 882-8080 | Phone: +1 425 882-8080 | |||
| EMail: mailto:michelsu@microsoft.com | EMail: mailto:michelsu@microsoft.com | |||
| URI: http://www.suignard.com | URI: http://www.suignard.com | |||
| Full Copyright Statement | Full Copyright Statement | |||
| Copyright (C) The Internet Society (2003). All Rights Reserved. | Copyright (C) The Internet Society (2004). All Rights Reserved. | |||
| This document and translations of it may be copied and furnished to | This document and translations of it may be copied and furnished to | |||
| others, and derivative works that comment on or otherwise explain it | others, and derivative works that comment on or otherwise explain it | |||
| or assist in its implementation may be prepared, copied, published | or assist in its implementation may be prepared, copied, published | |||
| and distributed, in whole or in part, without restriction of any | and distributed, in whole or in part, without restriction of any | |||
| kind, provided that the above copyright notice and this paragraph are | kind, provided that the above copyright notice and this paragraph are | |||
| included on all such copies and derivative works. However, this | included on all such copies and derivative works. However, this | |||
| document itself may not be modified in any way, such as by removing | document itself may not be modified in any way, such as by removing | |||
| the copyright notice or references to the Internet Society or other | the copyright notice or references to the Internet Society or other | |||
| Internet organizations, except as needed for the purpose of | Internet organizations, except as needed for the purpose of | |||
| End of changes. | ||||
This html diff was produced by rfcdiff 1.12, available from http://www.levkowetz.com/ietf/tools/rfcdiff/ | ||||