WARNING:
The following target names do not exist:
url, url, url, url, url, url, url, absolute-url, absolute-url, url, url-character-encoding, absolute-url, absolute-url, document-base-url, references, absolute-url, document-base-url, document-base-url, resolve-a-url, document-base-url, parse-a-url, url, resolve-a-url
Web addresses in HTML 5
This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.
Copyright © The Internet Society (2009). All Rights Reserved.
This specification defines the handling of Web addresses for Hypertext Markup Language (HTML) 5, the fifth major revision of the core language of the World Wide Web. In this version, special attention has been given to defining clear conformance criteria for user agents in an effort to improve interoperability.
This specification defines the term Web address, and defines various algorithms for dealing with Web addresses, because for historical reasons the rules defined by the URI and IRI specifications are not a complete description of what HTML user agents need to implement to be compatible with Web content.
A Web address is a string used to identify a resource.
The term "Web address" in this specification is used to include not only Uniform Resource Identifiers (URIs) as they are defined by RFC 3986 [ref-RFC3986] and Internationalized Resource Identifiers (IRIs) as they are defined by RFC 3987 [ref-RFC3987], but also other strings of characters which can be used to identify Web resources when processed appropriately.
A Web address is a valid Web address if at least one of the following conditions holds:
- The Web address is a valid URI reference (i.e. it matches the grammar for <URI-reference> given in RFC 3986 [ref-RFC3986]).
- The Web address is a valid IRI reference (i.e. it matches the grammar for <IRI-reference> given in RFC 3987 [ref-RFC3987]), and it has no query component.
- The Web address is a valid IRI reference and its query component contains no unescaped non-ASCII characters [RFC3987] [ref-RFC3987].
- The Web address is a valid IRI reference and the character encoding of the Web address's Document is UTF-8 or UTF-16 [RFC3987] [ref-RFC3987].
A Web address has an associated URL character encoding, determined as follows:
- If the Web address came from a script (e.g. as an argument to a
method)
- The Web address character encoding is the script's character encoding.
- If the Web address came from a DOM node (e.g. from an element)
- The node has a Document, and the URL character encoding is the document's character encoding.
- If the Web address had a character encoding defined when the Web address was
created or defined
- The Web address character encoding is as defined.
To parse a Web address w into its component parts, the user agent must use the following steps:
- Strip leading and trailing space characters from w.
- Percent-encode all non-URI characters in w. Note: the 2nd step will replace all of the following characters with a percent-encoded equivalent:
- all characters with codepoints less than or equal to U+0020 (i.e. the C0 control characters)
- all characters with codepoints greater than or equal to U+007% (i.e. U+007?F and all non-ASCII characters in the w)
- U+0022 double quotation mark
- U+0025 percent sign
- U+003C less-than sign
- U+003E greater-than sign mark
- U+005C reverse solidus (backslash)
- U+005E circumflex accent
- U+0060 grave accent
- U+007B left curly bracket
- U+007C vertical line
- U+007D right curly bracket
- If w begins with either of:
- a string matching the <scheme> production, followed by "://"
- the string "//"
then percent-encode any left or right square brackets (U+005B, U+005D, "[" and "]") following the first occurrence of "/", "?", or "#" which follows the first occurrence of "//". - Otherwise, percent-encode all left and right square brackets.
- Percent-encode all occurrences of U+0023 (Number sign, "#") after the first.
- Parse w using the grammar in RFC 3986 [ref-RFC3986].
- If w doesn't match the <URI-reference> production, even after the above changes are made to it, then parsing the Web address fails with an error. [RFC3986] [ref-RFC3986]
- Otherwise, parsing w was successful; the components of the Web address are substrings of w defined as follows. First, the substring of the modified w which matched a particular production in RFC 3986 [ref-RFC3986] is identified; then any percent-encoded characters in that substring are decoded. The resulting string (called here the "decoded substring) is one of the named components of w. As a result of percent-encoding the percent sign, any occurrences of percent-encoding in the Web address will be double-encoded at this step.
- <scheme>
- The decoded substring matched by the <scheme> production, if any.
- <host>
- The decoded substring matched by the <host> production, if any.
- <port>
- The decoded substring matched by the <port> production, if any.
- <hostport>
- If there is a <scheme> component and a <port> component and the port given by the <port> component is different than the default port defined for the protocol given by the <scheme> component, then <hostport> is the decoded substring that starts with the decoded substring matched by the <host> production and ends with the decoded substring matched by the <port> production, and includes the colon in between the two. Otherwise, it is the same as the <host> component.
- <path>
- The decoded substring matched by one of the following productions, if one of them was matched:
- <path-abempty>
- <path-absolute>
- <path-noscheme>
- <path-rootless>
- <path-empty>
- <query>
- The decoded substring matched by the <query> production, if any.
- <fragment>
- The decoded substring matched by the <fragment> production, if any.
- <host-specific>
- The decoded substring that follows the decoded substring matched by the <authority> production, or the whole string if the <authority> production wasn't matched.
To resolve a Web address to an absolute Web adddress relative to either another absolute Web address or an element, the user agent must use the following steps. Resolving a Web address can result in an error, in which case the Web address is not resolvable.
- Let w be the Web address being resolved.
- Let encoding be the character encoding of the Web address.
- If encoding is UTF-16, then change it to UTF-8.
- If the algorithm was invoked with an absolute Web address to use as the base Web address, let base be that absolute Web address.
- Otherwise, let base be the base URI of the element, as defined by the XML Base specification, with the base URI of the document entity being defined as the document base Web address of the Document that owns the element. [XMLBASE]
- For the purposes of the XML Base specification, user agents must act as if all Document objects represented XML documents.
- It is possible for xml:base attributes to be present even in HTML fragments, as such attributes can be added dynamically using script. (Such scripts would not be conforming, however, as xml:base attributes are not allowed in HTML documents.)
- The document base Web address of a Document is the absolute Web address obtained by running these substeps:
- Let fallback base url be the document's address.
- If fallback base url is about:blank, and the Document's browsing context has a creator browsing context, then let fallback base url be the document base Web address of the creator Document instead.
- If there is no base element that is both a child of the head element and has an href attribute, then the document base Web address is fallback base url.
- Otherwise, let w be the value of the href attribute of the first such element.
- Resolve w relative to fallback base url (thus, the base href attribute isn't affected by xml:base attributes).
- The document base Web address is the result of the previous step if it was successful; otherwise it is fallback base url.
- Parse w into its component parts.
- If parsing w resulted in a <host> component, then replace the matching subtring of w with the string that results from expanding any sequences of percent-encoded octets in that component that are valid UTF-8 sequences into Unicode characters as defined by UTF-8.
- If any percent-encoded octets in that component are not valid UTF-8 sequences, then return an error and abort these steps.
- Apply the IDNA ToASCII algorithm to the matching substring, with both the AllowUnassigned and UseSTD3ASCIIRules flags set. Replace the matching substring with the result of the ToASCII algorithm.
- If ToASCII fails to convert one of the components of the string, e.g. because it is too long or because it contains invalid characters, then return an error and abort these steps [RFC3490] [ref-RFC3490].
- If parsing w resulted in a <path> component, then replace the matching substring of w with the string that results from applying the following steps to each character other than U+0025 PERCENT SIGN (%) that doesn't match the original <path> production defined in RFC 3986:
- Encode the character into a sequence of octets as defined by UTF-8.
- Replace the character with the percent-encoded form of those octets. [RFC3986] [ref-RFC3986]
For instance if w was "//example.com/a^b☺c%FFd%z/?e", then the <path> component's substring would be "/a^b☺c%FFd%z/" and the two characters that would have to be escaped would be "^" and "☺". The result after this step was applied would therefore be that w now had the value "//example.com/a%5Eb%E2%98%BAc%FFd%z/?e". - If parsing w resulted in a <query> component, then replace the matching substring of w with the string that results from applying the following steps to each character other than U+0025 PERCENT SIGN (%) that doesn't match the original <query> production defined in RFC 3986:
- If the character in question cannot be expressed in the encoding encoding, then replace it with a single 0x3F octet (an ASCII question mark) and skip the remaining substeps for this character.
- Encode the character into a sequence of octets as defined by the encoding encoding.
- Replace the character with the percent-encoded form of those octets. [RFC3986] [ref-RFC3986]
- Apply the algorithm described in RFC 3986 section 5.2 Relative Resolution, using w as the potentially relative URI reference (R), and base as the base URI (Base). [RFC3986] [ref-RFC3986]
- Apply any relevant conformance criteria of RFC 3986 and RFC 3987, returning an error and aborting these steps if appropriate. [RFC3986] [ref-RFC3986] [RFC3987] [ref-RFC3987]
- For instance, if an absolute URI that would be returned by the above algorithm violates the restrictions specific to its scheme, e.g. a data: URI using the "//" server-based naming authority syntax, then user agents are to treat this as an error instead.
- Let result be the target URI (T) returned by the Relative Resolution algorithm.
- If result uses a scheme with a server-based naming authority, replace all U+005C REVERSE SOLIDUS (\) characters in result with U+002F SOLIDUS (/) characters.
- Return result.
A Web address is an absolute Web address if resolving it results in the same Web address without an error.
To parse a Web address w into its component parts, the user agent must use the following steps:
- Strip leading and trailing space characters from w.
- Parse w in the manner defined by RFC 3986, with the following exceptions:
- Add all characters with codepoints less than or equal to U+0020 or greater than or equal to U+007F to the <unreserved> production.
- Add the characters U+0022, U+003C, U+003E, U+005B .. U+005E, U+0060, and U+007B .. U+007D to the <unreserved> production.
- Add a single U+0025 PERCENT SIGN character as a second alternative way of matching the <pct-encoded> production, except when the <pct-encoded> is used in the <reg-name> production.
- Add the U+0023 NUMBER SIGN character to the characters allowed in the <fragment> production.
-
- If w doesn't match the <URI-reference> production, even after the above changes are made to the ABNF definitions, then parsing the Web address fails with an error. [RFC3986] [ref-RFC3986]
- Otherwise, parsing w was successful; the components of the Web address are substrings of w defined as follows:
- <scheme>
- The substring matched by the <scheme> production, if any.
- <host>
- The substring matched by the <host> production, if any.
- <port>
- The substring matched by the <port> production, if any.
- <hostport>
- If there is a <scheme> component and a <port> component and the port given by the <port> component is different than the default port defined for the protocol given by the <scheme> component, then <hostport> is the substring that starts with the substring matched by the <host> production and ends with the substring matched by the <port> production, and includes the colon in between the two. Otherwise, it is the same as the <host> component.
- <path>
- The substring matched by one of the following productions, if one of them was matched:
- <path-abempty>
- <path-absolute>
- <path-noscheme>
- <path-rootless>
- <path-empty>
- <query>
- The substring matched by the <query> production, if any.
- <fragment>
- The substring matched by the <fragment> production, if any.
- <host-specific>
- The substring that follows the substring matched by the <authority> production, or the whole string if the <authority> production wasn't matched.
6. References
[ref-RFC3490] | this is a dummy org; it's not used, is it?, “Internationalizing Domain Names in Applications (IDNA)”, RFC 3490, RFC 3490, March 2003, <http//www.ietf.org/rfc/rfc3490.txt>. |
[ref-RFC3986] | this is a dummy org; it's not used, is it?, “Uniform Resource Identifier (URI): Generic Syntax”, RFC 3986, RFC 3986, January 2005, <http Generic Syntax",
RFC 3986, January 2005.
<http://www.ietf.org/rfc/rfc3986.txt>. |
[ref-RFC3987] | this is a dummy org; it's not used, is it?, “Internationalized Resource Identifiers (IRIs)”, RFC 3987, RFC 3987, January 2005, <http//www.ietf.org/rfc/rfc3987.txt>. |
[ref-XMLBase] | this is a dummy org; it's not used, is it?, “XML Base (Second Edition)”, W3C Recommendation 28 January 2009, <http//www.w3.org/TR/xmlbase/>. |
Dan ConnollyConnollyDanMidwest Web Sense LLC and W3CEMail: connolly@w3.orgURI: http://www.w3.org/People/Connolly/C. M. Sperberg-McQueenM. Sperberg-McQueenC.Black Mesa Technologies LLCEMail: cmsmcq@blackmesatech.comURI: http://www.blackmesatech.com/who/cmsmcq/Copyright © The Internet Society (2009). All Rights Reserved.
This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English.
The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assignees.
This document and the information contained herein is provided on an “AS IS” basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP-11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementors or users of this specification can be obtained from the IETF Secretariat.
The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director.