Network Working Group D. Connolly Internet-Draft Midwest Web Sense LLC and W3C Expires: November 22, 2009 C. M. Sperberg-McQueen Black Mesa Technologies LLC May 21, 2009 Web addresses in HTML 5 draft-connolly-href-00 Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on November 22, 2009. Connolly & M. Sperberg-McQueen Expires November 22, 2009 [Page 1] Internet-Draft Web addresses in HTML 5 May 2009 Abstract This specification defines the handling of Web addresses for Hypertext Markup Language (HTML) 5, the fifth major revision of the core language of the World Wide Web. In this version, special attention has been given to defining clear conformance criteria for user agents in an effort to improve interoperability. Connolly & M. Sperberg-McQueen Expires November 22, 2009 [Page 2] Internet-Draft Web addresses in HTML 5 May 2009 1. Introduction This specification defines the term Web address (Section 2), and defines various algorithms for dealing with Web addresses, because for historical reasons the rules defined by the URI and IRI specifications are not a complete description of what HTML user agents need to implement to be compatible with Web content. Connolly & M. Sperberg-McQueen Expires November 22, 2009 [Page 3] Internet-Draft Web addresses in HTML 5 May 2009 2. Terminology A _Web address_ is a string used to identify a resource. The term "Web address" in this specification is used to include not only Uniform Resource Identifiers (URIs) as they are defined by RFC 3986 [ref-RFC3986] and Internationalized Resource Identifiers (IRIs) as they are defined by RFC 3987 [ref-RFC3987], but also other strings of characters which can be used to identify Web resources when processed appropriately. A Web address (Section 2) is a _valid Web address_ if at least one of the following conditions holds: o The Web address (Section 2) is a valid URI reference (i.e. it matches the grammar for given in RFC 3986 [ref-RFC3986]). o The Web address (Section 2) is a valid IRI reference (i.e. it matches the grammar for given in RFC 3987 [ref-RFC3987]), and it has no query component. o The Web address (Section 2) is a valid IRI reference and its query component contains no unescaped non-ASCII characters [RFC3987] [ref-RFC3987]. o The Web address (Section 2) is a valid IRI reference and the character encoding of the Web address's "Document" is UTF-8 or UTF-16 [RFC3987] [ref-RFC3987]. A Web address (Section 2) has an associated _URL character encoding_, determined as follows: If the Web address came from a script (e.g. as an argument to a method) The Web address character encoding is the script's character encoding. If the Web address came from a DOM node (e.g. from an element) The node has a "Document", and the URL character encoding is the document's character encoding. If the Web address had a character encoding defined when the Web address was created or defined The Web address character encoding is as defined. Connolly & M. Sperberg-McQueen Expires November 22, 2009 [Page 4] Internet-Draft Web addresses in HTML 5 May 2009 3. Parsing Web addresses To _parse a Web address_ _w_ into its component parts, the user agent must use the following steps: 1. Strip leading and trailing space characters from _w_. 2. Percent-encode all non-URI characters in _w_. Note: the 2nd step will replace all of the following characters with a percent- encoded equivalent: * all characters with codepoints less than or equal to U+0020 (i.e. the C0 control characters) * all characters with codepoints greater than or equal to U+007% (i.e. U+007?F and all non-ASCII characters in the _w_) * U+0022 double quotation mark * U+0025 percent sign * U+003C less-than sign * U+003E greater-than sign mark * U+005C reverse solidus (backslash) * U+005E circumflex accent * U+0060 grave accent * U+007B left curly bracket * U+007C vertical line * U+007D right curly bracket 3. If _w_ begins with either of: * a string matching the production, followed by ""://"" * the string "//" then percent-encode any left or right square brackets (U+005B, U+005D, ""["" and ""]"") following the first occurrence of ""/"", ""?"", or ""#"" which _follows_ the first occurrence of ""//"". Connolly & M. Sperberg-McQueen Expires November 22, 2009 [Page 5] Internet-Draft Web addresses in HTML 5 May 2009 4. Otherwise, percent-encode all left and right square brackets. 5. Percent-encode all occurrences of U+0023 (Number sign, ""#"") after the first. 6. Parse _w_ using the grammar in RFC 3986 [ref-RFC3986]. 7. If _w_ doesn't match the production, even after the above changes are made to it, then parsing the Web address fails with an error. [RFC3986] [ref-RFC3986] 8. Otherwise, parsing _w_ was successful; the components of the Web address are substrings of _w_ defined as follows. First, the substring of the modified _w_ which matched a particular production in RFC 3986 [ref-RFC3986] is identified; then any percent-encoded characters in that substring are decoded. The resulting string (called here the "decoded substring) is one of the named components of _w_. As a result of percent-encoding the percent sign, any occurrences of percent-encoding in the Web address will be double-encoded at this step. The decoded substring matched by the production, if any. The decoded substring matched by the production, if any. The decoded substring matched by the production, if any. If there is a component and a component and the port given by the component is different than the default port defined for the protocol given by the component, then is the decoded substring that starts with the decoded substring matched by the production and ends with the decoded substring matched by the production, and includes the colon in between the two. Otherwise, it is the same as the component. The decoded substring matched by one of the following productions, if one of them was matched: + + Connolly & M. Sperberg-McQueen Expires November 22, 2009 [Page 6] Internet-Draft Web addresses in HTML 5 May 2009 + + + The decoded substring matched by the production, if any. The decoded substring matched by the production, if any. The decoded substring that _follows_ the decoded substring matched by the production, or the whole string if the production wasn't matched. Connolly & M. Sperberg-McQueen Expires November 22, 2009 [Page 7] Internet-Draft Web addresses in HTML 5 May 2009 4. Resolving Web addresses To _resolve a Web address_ to an absolute Web adddress relative to either another absolute Web address or an element, the user agent must use the following steps. Resolving a Web address can result in an error, in which case the Web address is not resolvable. 1. Let _w_ be the Web address being resolved. 2. Let _encoding_ be the character encoding of the Web address. 3. If _encoding_ is UTF-16, then change it to UTF-8. 4. If the algorithm was invoked with an absolute Web address to use as the base Web address, let _base_ be that absolute Web address. 5. Otherwise, let _base_ be the _base URI of the element_, as defined by the XML Base specification, with _the base URI of the document entity_ being defined as the document base Web address of the "Document" that owns the element. [XMLBASE] [ref-XMLBase] 6. For the purposes of the XML Base specification, user agents must act as if all "Document" objects represented XML documents. 7. It is possible for "xml:base" attributes to be present even in HTML fragments, as such attributes can be added dynamically using script. (Such scripts would not be conforming, however, as "xml:base" attributes are not allowed in HTML documents.) 8. The _document base Web address_ of a "Document" is the absolute Web address obtained by running these substeps: 1. Let _fallback base url_ be the document's address. 2. If _fallback base url_ is "about:blank", and the "Document"'s browsing context has a creator browsing context, then let _fallback base url_ be the document base Web address of the creator "Document" instead. 3. If there is no "base" element that is both a child of the "head" element and has an "href" attribute, then the document base Web address is _fallback base url_. 4. Otherwise, let _w_ be the value of the "href" attribute of the first such element. Connolly & M. Sperberg-McQueen Expires November 22, 2009 [Page 8] Internet-Draft Web addresses in HTML 5 May 2009 5. Resolve _w_ relative to _fallback base url_ (thus, the "base" "href" attribute isn't affected by "xml:base" attributes). 6. The document base Web address is the result of the previous step if it was successful; otherwise it is _fallback base url_. 9. Parse _w_ into its component parts. 10. If parsing _w_ resulted in a component, then replace the matching subtring of _w_ with the string that results from expanding any sequences of percent-encoded octets in that component that are valid UTF-8 sequences into Unicode characters as defined by UTF-8. 11. If any percent-encoded octets in that component are not valid UTF-8 sequences, then return an error and abort these steps. 12. Apply the IDNA ToASCII algorithm to the matching substring, with both the AllowUnassigned and UseSTD3ASCIIRules flags set. Replace the matching substring with the result of the ToASCII algorithm. 13. If ToASCII fails to convert one of the components of the string, e.g. because it is too long or because it contains invalid characters, then return an error and abort these steps [RFC3490] [ref-RFC3490]. 14. If parsing _w_ resulted in a component, then replace the matching substring of _w_ with the string that results from applying the following steps to each character other than U+0025 PERCENT SIGN (%) that doesn't match the original production defined in RFC 3986: 1. Encode the character into a sequence of octets as defined by UTF-8. 2. Replace the character with the percent-encoded form of those octets. [RFC3986] [ref-RFC3986] For instance if _w_ was ""//example.com/a^b☺c%FFd%z/?e"", then the component's substring would be ""/a^b☺c%FFd%z/"" and the two characters that would have to be escaped would be ""^"" and ""☺"". The result after this step was applied would therefore be that _w_ now had the value ""//example.com/a%5Eb%E2%98%BAc%FFd%z/?e"". Connolly & M. Sperberg-McQueen Expires November 22, 2009 [Page 9] Internet-Draft Web addresses in HTML 5 May 2009 15. If parsing _w_ resulted in a component, then replace the matching substring of _w_ with the string that results from applying the following steps to each character other than U+0025 PERCENT SIGN (%) that doesn't match the original production defined in RFC 3986: 1. If the character in question cannot be expressed in the encoding _encoding_, then replace it with a single 0x3F octet (an ASCII question mark) and skip the remaining substeps for this character. 2. Encode the character into a sequence of octets as defined by the encoding _encoding_. 3. Replace the character with the percent-encoded form of those octets. [RFC3986] [ref-RFC3986] 16. Apply the algorithm described in RFC 3986 section 5.2 Relative Resolution, using _w_ as the potentially relative URI reference (_R_), and _base_ as the base URI (_Base_). [RFC3986] [ref-RFC3986] 17. Apply any relevant conformance criteria of RFC 3986 and RFC 3987, returning an error and aborting these steps if appropriate. [RFC3986] [ref-RFC3986] [RFC3987] [ref-RFC3987] 18. For instance, if an absolute URI that would be returned by the above algorithm violates the restrictions specific to its scheme, e.g. a "data:" URI using the ""//"" server-based naming authority syntax, then user agents are to treat this as an error instead. 19. Let _result_ be the target URI (_T_) returned by the Relative Resolution algorithm. 20. If _result_ uses a scheme with a server-based naming authority, replace all U+005C REVERSE SOLIDUS (\) characters in _result_ with U+002F SOLIDUS (/) characters. 21. Return _result_. A Web address (Section 2) is an _absolute Web address_ if resolving it results in the same Web address without an error. Connolly & M. Sperberg-McQueen Expires November 22, 2009 [Page 10] Internet-Draft Web addresses in HTML 5 May 2009 5. References [ref-RFC3490] "Internationalizing Domain Names in Applications (IDNA)", RFC 3490, March 2003, . [ref-RFC3986] "Uniform Resource Identifier (URI): Generic Syntax", RFC 3986, January 2005, . [ref-RFC3987] "Internationalized Resource Identifiers (IRIs)", RFC 3987, January 2005, . [ref-XMLBase] "XML Base (Second Edition)", . Connolly & M. Sperberg-McQueen Expires November 22, 2009 [Page 11] Internet-Draft Web addresses in HTML 5 May 2009 Authors' Addresses Dan Connolly Midwest Web Sense LLC and W3C Email: connolly@w3.org URI: http://www.w3.org/People/Connolly/ C. M. Sperberg-McQueen Black Mesa Technologies LLC Email: cmsmcq@blackmesatech.com URI: http://www.blackmesatech.com/who/cmsmcq/ Connolly & M. Sperberg-McQueen Expires November 22, 2009 [Page 12] Internet-Draft Web addresses in HTML 5 May 2009 Full Copyright Statement Copyright (C) The IETF Trust (2009). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Connolly & M. Sperberg-McQueen Expires November 22, 2009 [Page 13]