This is revision 1.5612.
This specification defines the term URL, and defines various algorithms for dealing with URLs, because for historical reasons the rules defined by the URI and IRI specifications are not a complete description of what HTML user agents need to implement to be compatible with Web content.
The term "URL" in this specification is used in a manner distinct from the precise technical meaning it is given in RFC 3986. Readers familiar with that RFC will find it easier to read this specification if they pretend the term "URL" as used herein is really called something else altogether. This is a willful violation of RFC 3986. [RFC3986]
A URL is a string used to identify a resource.
A URL is a valid URL if at least one of the following conditions holds:
A string is a valid non-empty URL if it is a valid URL but it is not the empty string.
A string is a valid URL potentially surrounded by spaces if, after stripping leading and trailing whitespace from it, it is a valid URL.
A string is a valid non-empty URL potentially surrounded by spaces if, after stripping leading and trailing whitespace from it, it is a valid non-empty URL.
This specification defines the URL
about:legacy-compat as a reserved, though
about: URI, for use in DOCTYPEs in HTML
documents when needed for compatibility with XML tools. [ABOUT]
This specification defines the URL
about:srcdoc as a reserved, though
about: URI, that is used as
the document's address of
srcdoc documents. [ABOUT]
To parse a URL url into its component parts, the user agent must use the following steps:
Strip leading and trailing whitespace from url.
Parse url in the manner defined by RFC 3986, with the following exceptions:
If url doesn't match the <URI-reference> production, even after the above changes are made to the ABNF definitions, then parsing the URL fails with an error. [RFC3986]
Otherwise, parsing url was successful; the components of the URL are substrings of url defined as follows:
The substring matched by the <scheme> production, if any.
The substring matched by the <host> production, if any.
The substring matched by the <port> production, if any.
If there is a <scheme> component and a <port> component and the port given by the <port> component is different than the default port defined for the protocol given by the <scheme> component, then <hostport> is the substring that starts with the substring matched by the <host> production and ends with the substring matched by the <port> production, and includes the colon in between the two. Otherwise, it is the same as the <host> component.
The substring matched by one of the following productions, if one of them was matched:
The substring matched by the <query> production, if any.
The substring matched by the <fragment> production, if any.
The substring that follows the substring matched by the <authority> production, or the whole string if the <authority> production wasn't matched.
Resolving a URL is the process of taking a relative URL and obtaining the absolute URL that it implies.
To resolve a URL to an absolute URL relative to either another absolute URL or an element, the user agent must use the following steps. Resolving a URL can result in an error, in which case the URL is not resolvable.
Let url be the URL being resolved.
Let encoding be determined as follows:
Document, and the URL character encoding is the document's character encoding.
If encoding is a UTF-16 encoding, then change the value of encoding to UTF-8.
Otherwise, let base be the base URI of
the element, as defined by the XML Base specification, with
the base URI of the document entity being defined as the
document base URL of the
owns the element. [XMLBASE]
For the purposes of the XML Base specification, user agents
must act as if all
Document objects represented XML
It is possible for
xml:base attributes to be present
even in HTML fragments, as such attributes can be added
dynamically using script. (Such scripts would not be conforming,
are not allowed in HTML documents.)
Let fallback base url be the document's address.
If there is no
base element that has an
href attribute, then the
document base URL is fallback base
url; abort these steps. Otherwise, let url be the value of the
href attribute of the first such
The document base URL is the result of the previous step if it was successful; otherwise it is fallback base url.
Parse url into its component parts.
If parsing url resulted in a <host> component, then replace the matching substring of url with the string that results from expanding any sequences of percent-encoded octets in that component that are valid UTF-8 sequences into Unicode characters as defined by UTF-8.
If any percent-encoded octets in that component are not valid UTF-8 sequences (e.g. sequences of percent-encoded octets that expand to surrogate code points), then return an error and abort these steps.
Apply the IDNA ToASCII algorithm to the matching substring, with both the AllowUnassigned and UseSTD3ASCIIRules flags set. Replace the matching substring with the result of the ToASCII algorithm.
If ToASCII fails to convert one of the components of the string, e.g. because it is too long or because it contains invalid characters, then return an error and abort these steps. [RFC3490]
If parsing url resulted in a <path> component, then replace the matching substring of url with the string that results from applying the following steps to each character other than "%" (U+0025) that doesn't match the original <path> production defined in RFC 3986:
For instance if url was "
//example.com/a^b☺c%FFd%z/?e", then the
<path> component's substring
would be "
/a^b☺c%FFd%z/" and the two
characters that would have to be escaped would be "
^" and "
result after this step was applied would therefore be that url now had the value "
If parsing url resulted in a <query> component, then replace the matching substring of url with the string that results from applying the following steps to each character other than "%" (U+0025) that doesn't match the original <query> production defined in RFC 3986:
Apply the algorithm described in RFC 3986 section 5.2 Relative Resolution, using url as the potentially relative URI reference (R), and base as the base URI (Base). [RFC3986]
For instance, if an absolute URI that would be
returned by the above algorithm violates the restrictions specific
to its scheme, e.g. a
data: URI using the
//" server-based naming authority syntax,
then user agents are to treat this as an error instead.
Let result be the target URI (T) returned by the Relative Resolution algorithm.
If result uses a scheme with a server-based naming authority, replace all "\" (U+005C) characters in result with "/" (U+002F) characters.
Some of the steps in these rules, for example the processing of "\" (U+005C) characters, are a willful violation of RFC 3986 and RFC 3987, motivated by a desire to handle legacy content. [RFC3986] [RFC3987]
A URL is an absolute URL if resolving it results in the same output regardless of what it is resolved relative to, and that output is not a failure.
An absolute URL is a hierarchical URL if, when resolved and then parsed, there is a character immediately after the <scheme> component and it is a "/" (U+002F) character.
An absolute URL is an authority-based URL if, when resolved and then parsed, there are two characters immediately after the <scheme> component and they are both "//" (U+002F) characters.
To fragment-escape a string input, a user agent must run the following steps:
Let input be the string to be escaped.
Let position point at the first character of input.
Let output be an empty string.
Loop: If position is past the end of input, then jump to the step labeled end.
If the character in input pointed to by position is in the range U+0000 to U+0020 or is one of the following characters:
...then append the percent-encoded form of the character to output. [RFC3986]
Otherwise, append the character itself to output.
This escapes any ASCII characters that are not valid in the URI <fragment> production without being escaped.
Advance position to the next character in input.
Return to the step labeled loop.
End: Return output.
If the absolute URL identified by the hyperlink is
being shown to the user, or if any data derived from that URL is
affecting the display, then the
href attribute should be re-resolved relative to the element
and the UI updated appropriately.
delelement with a
If the absolute URL identified by the
cite attribute is being shown to the user, or if
any data derived from that URL is affecting the display, then the
URL should be re-resolved relative to the element and the UI updated
The element is not directly affected.
For instance, changing the base URL doesn't
affect the image displayed by
img elements, although
subsequent accesses of the
IDL attribute from script will return a new absolute
URL that might no longer correspond to the image being
An interface that has a complement of URL decomposition IDL attributes has seven attributes with the following definitions:
attribute DOMString protocol; attribute DOMString host; attribute DOMString hostname; attribute DOMString port; attribute DOMString pathname; attribute DOMString search; attribute DOMString hash;
protocol[ = value ]
Returns the current scheme of the underlying URL.
Can be set, to change the underlying URL's scheme.
host[ = value ]
Returns the current host and port (if it's not the default port) in the underlying URL.
Can be set, to change the underlying URL's host and port.
The host and the port are separated by a colon. The port part, if omitted, will be assumed to be the current scheme's default port.
hostname[ = value ]
Returns the current host in the underlying URL.
Can be set, to change the underlying URL's host.
port[ = value ]
Returns the current port in the underlying URL.
Can be set, to change the underlying URL's port.
pathname[ = value ]
Returns the current path in the underlying URL.
Can be set, to change the underlying URL's path.
search[ = value ]
Returns the current query component in the underlying URL.
Can be set, to change the underlying URL's query component.
hash[ = value ]
Returns the current fragment identifier in the underlying URL.
Can be set, to change the underlying URL's fragment identifier.
The attributes defined to be URL decomposition IDL attributes must act as described for the attributes with the same corresponding names in this section.
In addition, an interface with a complement of URL decomposition IDL attributes defines an input, which is a URL that the attributes act on, and a common setter action, which is a set of steps invoked when any of the attributes' setters are invoked.
The seven URL decomposition IDL attributes have similar requirements.
On getting, if the input is an absolute URL that fulfills the condition given in the "getter condition" column corresponding to the attribute in the table below, the user agent must return the part of the input URL given in the "component" column, with any prefixes specified in the "prefix" column appropriately added to the start of the string and any suffixes specified in the "suffix" column appropriately added to the end of the string. Otherwise, the attribute must return the empty string.
On setting, the new value must first be mutated as described by the "setter preprocessor" column, then mutated by %-escaping any characters in the new value that are not valid in the relevant component as given by the "component" column. Then, if the input is an absolute URL and the resulting new value fulfills the condition given in the "setter condition" column, the user agent must make a new string output by replacing the component of the URL given by the "component" column in the input URL with the new value; otherwise, the user agent must let output be equal to the input. Finally, the user agent must invoke the common setter action with the value of output.
When replacing a component in the URL, if the component is part of an optional group in the URL syntax consisting of a character followed by the component, the component (including its prefix character) must be included even if the new value is the empty string.
The previous paragraph applies in particular to the
:" before a <port> component, the "
?" before a <query> component, and the "
#" before a <fragment> component.
For the purposes of the above definitions, URLs must be parsed using the URL parsing rules defined in this specification.
|Attribute||Component||Getter Condition||Prefix||Suffix||Setter Preprocessor||Setter Condition|
|<scheme>||—||—||":" (U+003A)||Remove all trailing ":" (U+003A) characters||The new value is not the empty string|
|<hostport>||input is an authority-based URL||—||—||—||The new value is not the empty string and input is an authority-based URL|
|<host>||input is an authority-based URL||—||—||Remove all leading "/" (U+002F) characters||The new value is not the empty string and input is an authority-based URL|
|<port>||input is an authority-based URL, and contained a <port> component (possibly an empty one)||—||—||Remove all characters in the new value from the first that is not in the range ASCII digits, if any. Remove any leading "0" (U+0030) characters in the new value. If the resulting string is empty, set it to a single "0" (U+0030) character.||input is an authority-based URL, and the new value, when interpreted as a base-ten integer, is less than or equal to 65535|
|<path>||input is a hierarchical URL||—||—||If it has no leading "/" (U+002F) character, prepend a "/" (U+002F) character to the new value||input is hierarchical|
|<query>||input is a hierarchical URL, and contained a <query> component (possibly an empty one)||"?" (U+003F)||—||Remove one leading "?" (U+003F) character, if any||input is a hierarchical URL|
|<fragment>||input contained a non-empty <fragment> component||"#" (U+0023)||—||Remove one leading "#" (U+0023) character, if any||—|
The table below demonstrates how the getter condition for
search results in different results
depending on the exact original syntax of the URL:
|Input URL|| ||Explanation|
| ||empty string||No <query> component in input URL.|
| || ||There is a <query> component, but it is empty. The question mark in the resulting value is the prefix.|
| || || The <query> component has the value "|
| || ||The (empty) <fragment> component is not part of the <query> component.|
The following table is similar; it provides a list of what each of the URL decomposition IDL attributes returns for a given input URL.
|(empty string)||(empty string)|
When a user agent is to fetch a resource or URL, optionally from an origin origin, and optionally with a synchronous flag, a manual redirect flag, a force same-origin flag, and/or a block cookies flag, the following steps must be run. (When a URL is to be fetched, the URL identifies a resource to be obtained.)
Let document be the appropriate
Document as given by the following list:
Remove any <fragment> component from the generated address of the resource from which Request-URIs are obtained.
If the algorithm was not invoked with the synchronous flag, perform the remaining steps asynchronously.
This is the main step.
If the resource is identified by an absolute URL,
and the resource is to be obtained using an idempotent action
(such as an HTTP GET or
equivalent), and it is already being downloaded for other
reasons (e.g. another invocation of this algorithm), and this
request would be identical to the previous one (e.g. same
Origin headers), and the user agent is
configured such that it is to reuse the data from the existing
download instead of initiating a new one, then use the results of
the existing download instead of starting a new one.
Otherwise, if the resource is identified by an absolute
URL with a scheme that does not define a mechanism to
obtain the resource (e.g. it is a
URL) or that the user agent does not support, then act as if the
resource was an HTTP 204 No Content response with no other
Otherwise, if the resource is identified by the
about:blank, then the
resource is immediately available and consists of the empty
string, with no metadata.
Otherwise, at a time convenient to the user and the user agent,
download (or otherwise obtain) the resource, applying the
semantics of the relevant specifications (e.g. performing an HTTP
GET or POST operation, or reading the file from disk, dereferencing
For the purposes of the
Referer (sic) header, use the
address of the resource from which Request-URIs are
obtained generated in the earlier step.
For the purposes of the
header, if the fetching algorithm was
explicitly initiated from an origin, then the origin that initiated the HTTP request is origin. Otherwise, this is a request from
a "privacy-sensitive" context. [ORIGIN]
If the algorithm was not invoked with the block cookies flag, and there are cookies to be set, then the user agent must run the following substeps:
If the fetched resource is an HTTP redirect or equivalent, then:
Abort these steps and return failure from this algorithm, as if the remote host could not be contacted.
Continue, using the fetched resource (the redirect) as the result of the algorithm. If the calling algorithm subsequently requires the user agent to transparently follow the redirect, then the user agent must resume this algorithm from the main step, but using the target of the redirect as the resource to fetch, rather than the original resource.
First, apply any relevant requirements for redirects (such as showing any appropriate prompts). Then, redo main step, but using the target of the redirect as the resource to fetch, rather than the original resource.
The HTTP specification requires that 301, 302, and 307 redirects, when applied to methods other than the safe methods, not be followed without user confirmation. That would be an appropriate prompt for the purposes of the requirement in the paragraph above. [HTTP]
If the algorithm was not invoked with the synchronous flag: When the resource is available, or if there is an error of some description, queue a task that uses the resource as appropriate. If the resource can be processed incrementally, as, for instance, with a progressively interlaced JPEG or an HTML file, additional tasks may be queued to process the data as it is downloaded. The task source for these tasks is the networking task source.
Otherwise, return the resource or error information to the calling algorithm.
If the user agent can determine the actual length of the resource
being fetched for an instance of this
algorithm, and if that length is finite, then that length is the
file's size. Otherwise, the
subject of the algorithm (that is, the resource being fetched) has
no known size. (For
example, the HTTP
Content-Length header might
provide this information.)
The user agent must also keep track of the number of bytes downloaded for each instance of this algorithm. This number must exclude any out-of-band metadata, such as HTTP headers.
The navigation processing model handles redirects itself, overriding the redirection handling that would be done by the fetching algorithm.
Whether the type sniffing rules apply to the fetched resource depends on the algorithm that invokes the rules — they are not always applicable.
User agents can implement a variety of transfer protocols, but this specification mostly defines behavior in terms of HTTP. [HTTP]
The HTTP GET method is equivalent to the default retrieval action of the protocol. For example, RETR in FTP. Such actions are idempotent and safe, in HTTP terms.
The HTTP response codes are equivalent to statuses in other protocols that have the same basic meanings. For example, a "file not found" error is equivalent to a 404 code, a server error is equivalent to a 5xx code, and so on.
The HTTP headers are equivalent to fields in other protocols that have the same basic meaning. For example, the HTTP authentication headers are equivalent to the authentication aspects of the FTP protocol.
Anything in this specification that refers to HTTP also applies
to HTTP-over-TLS, as represented by URLs
User agents should report certificate errors to the user and must either refuse to download resources sent with erroneous certificates or must act as if such resources were in fact served with no encryption.
User agents should warn the user that there is a potential problem whenever the user visits a page that the user has previously visited, if the page uses less secure encryption on the second visit.
Not doing so can result in users not noticing man-in-the-middle attacks.
If a user connects to a server with a self-signed certificate, the user agent could allow the connection but just act as if there had been no encryption. If the user agent instead allowed the user to override the problem and then displayed the page as if it was fully and safely encrypted, the user could be easily tricked into accepting man-in-the-middle connections.
If a user connects to a server with full encryption, but the page then refers to an external resource that has an expired certificate, then the user agent will act as if the resource was unavailable, possibly also reporting the problem to the user. If the user agent instead allowed the resource to be used, then an attacker could just look for "secure" sites that used resources from a different host and only apply man-in-the-middle attacks to that host, for example taking over scripts in the page.
If a user bookmarks a site that uses a CA-signed certificate, and then later revisits that site directly but the site has started using a self-signed certificate, the user agent could warn the user that a man-in-the-middle attack is likely underway, instead of simply acting as if the page was not encrypted.
The Content-Type metadata of a resource must be obtained and interpreted in a manner consistent with the requirements of the Media Type Sniffing specification. [MIMESNIFF]
The sniffed type of a resource must be found in a manner consistent with the requirements given in the Media Type Sniffing specification for finding the sniffed-type of the relevant sequence of octets. [MIMESNIFF]
The rules for sniffing images specifically and the rules for distinguishing if a resource is text or binary are also defined in the Media Type Sniffing specification. Both sets of rules return a MIME type as their result. [MIMESNIFF]
It is imperative that the rules in the Media Type Sniffing specification be followed exactly. When a user agent uses different heuristics for content type detection than the server expects, security problems can occur. For more details, see the Media Type Sniffing specification. [MIMESNIFF]
The algorithm for extracting an encoding from a
meta element, given a string s, is as follows. It either returns an encoding or
Let position be a pointer into s, initially pointing at the start of the string.
Loop: Find the first seven characters in s after position that are an
ASCII case-insensitive match for the word "
charset". If no such match is found, return nothing
and abort these steps.
Skip any U+0009, U+000A, U+000C, U+000D, or U+0020
characters that immediately follow the word "
charset" (there might not be any).
If the next character is not a "=" (U+003D), then move position to point just before that next character, and jump back to the step labeled loop.
Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters that immediately follow the equals sign (there might not be any).
Process the next character as follows:
This algorithm is distinct from those in the HTTP specification (for example, HTTP doesn't allow the use of single quotes and requires supporting a backslash-escape mechanism that is not supported by this algorithm). While the algorithm is used in contexts that, historically, were related to HTTP, the syntax as supported by implementations diverged some time ago. [HTTP]
A CORS settings attribute is an enumerated attribute. The following table lists the keywords and states for the attribute — the keywords in the left column map to the states in the cell in the second column on the same row as the keyword.
|Anonymous||Cross-origin CORS requests for the element will not have the credentials flag set.|
|Use Credentials||Cross-origin CORS requests for the element will have the credentials flag set.|
The empty string is also a valid keyword, and maps to the Anonymous state. The attribute's invalid value default is the Anonymous state. The missing value default, used when the attribute is omitted, is the No CORS state.
When the user agent is required to perform a potentially CORS-enabled fetch of an absolute URL URL, with a mode mode that is either "No CORS", "Anonymous", or "Use Credentials", an origin origin, and a default origin behaviour default which is either "taint" or "fail", it must run the first applicable set of steps from the following list. The default origin behaviour is only used if mode is "No CORS". This algorithm wraps the fetch algorithm above, and labels the obtained resource as either CORS-same-origin or CORS-cross-origin, or blocks the resource entirely.
Run these substeps:
Let result have no value.
Fetch URL, with the manual redirect flag set.
Loop: Wait for the fetch algorithm to know if the result is a redirect or not.
If the result of the fetch is a redirect, and the mode is not "No CORS", and the origin of the target URL of the redirect is not the same origin as origin, then set URL to the the target URL of the redirect and return to the top of the potentially CORS-enabled fetch algorithm (this time, the branch below will be taken, resulting in the fetch being done in a CORS-aware fashion).
Otherwise, if the result of the fetch is a redirect, and result still has no value, then apply the CORS redirect steps, with the CORS credential flag set to true and the request rules being that the user agent continue to follow these steps. If this resumes the fetch algorithm, then return to the loop step. If it failed due to a failure of the CORS resource sharing check, then: if default is fail, then set result to fail and jump to the step labeled end; if default is taint, then set result to taint, transparently follow the redirect but with the manual redirect flag no longer set, and jump to the step labeled end below.
Otherwise, if the resource is not available (e.g. there is a network error) then set result to the same value as default, and jump to the step labeled end below.
Otherwise, perform a resource sharing check, with the CORS credential flag set to true. If it returns fail, then set result to the same value as default; otherwise, set result to success. Then, jump to the step labeled end below.
End: Jump to the appropriate step from the following list:
Discard all fetched data and prevent any tasks from the fetch algorithm from being queued. For the purposes of the calling algorithm, the user agent must act as if there was a fatal network error and no resource was obtained. The user agent may report a cross-origin resource access failure to the user (e.g. in a debugging console).
The tasks from the fetch algorithm are queued normally, but for the purposes of the calling algorithm, the obtained resource is CORS-cross-origin. The user agent may report a cross-origin resource access failure to the user (e.g. in a debugging console).
Run these steps:
Wait for the CORS cross-origin request status to have a value.
Jump to the appropriate step from the following list:
Discard all fetched data and prevent any tasks from the fetch algorithm from being queued. For the purposes of the calling algorithm, the user agent must act as if there was a fatal network error and no resource was obtained. If a CORS resource sharing check failed, the user agent may report a cross-origin resource access failure to the user (e.g. in a debugging console).