Table of contents
      1. 8.2.6 The end
      2. 8.2.7 Coercing an HTML DOM into an infoset
      3. 8.2.8 An introduction to error handling and strange cases in the parser
        1. 8.2.8.1 Misnested tags: <b><i></b></i>
        2. 8.2.8.2 Misnested tags: <b><p></b></p>
        3. 8.2.8.3 Unexpected markup in tables
        4. 8.2.8.4 Scripts that modify the page as it is being parsed
        5. 8.2.8.5 The execution of scripts that are moving across multiple documents
        6. 8.2.8.6 Unclosed formatting elements
    1. 8.3 Serializing HTML fragments
    2. 8.4 Parsing HTML fragments

8.2.6 The end

Once the user agent stops parsing the document, the user agent must run the following steps:

  1. Set the current document readiness to "interactive" and the insertion point to undefined.

  2. Pop all the nodes off the stack of open elements.

  3. If the list of scripts that will execute when the document has finished parsing is not empty, run these substeps:

    1. Spin the event loop until the first script in the list of scripts that will execute when the document has finished parsing has its "ready to be parser-executed" flag set and the parser's Document has no style sheet that is blocking scripts.

    2. Execute the first script in the list of scripts that will execute when the document has finished parsing.

    3. Remove the first script element from the list of scripts that will execute when the document has finished parsing (i.e. shift out the first entry in the list).

    4. If the list of scripts that will execute when the document has finished parsing is still not empty, repeat these substeps again from substep 1.

  4. Queue a task to fire a simple event that bubbles named DOMContentLoaded at the Document.

  5. Spin the event loop until the set of scripts that will execute as soon as possible and the list of scripts that will execute in order as soon as possible are empty.

  6. Spin the event loop until there is nothing that delays the load event in the Document.

  7. Queue a task to set the current document readiness to "complete".

  8. If the Document is in a browsing context, then queue a task to fire a simple event named load at the Document's Window object, but with its target set to the Document object (and the currentTarget set to the Window object).

  9. If the Document is in a browsing context, then queue a task to fire a pageshow event at the Window object of the Document, but with its target set to the Document object (and the currentTarget set to the Window object), using the PageTransitionEvent interface, with the persisted attribute set to false. This event must not bubble, must not be cancelable, and has no default action.

  10. If the Document has any pending application cache download process tasks, then queue each such task in the order they were added to the list of pending application cache download process tasks, and then empty the list of pending application cache download process tasks. The task source for these tasks is the networking task source.

  11. The Document is now ready for post-load tasks.

  12. Queue a task to mark the Document as completely loaded.

When the user agent is to abort a parser, it must run the following steps:

  1. Throw away any pending content in the input stream, and discard any future content that would have been added to it.

  2. Pop all the nodes off the stack of open elements.

Except where otherwise specified, the task source for the tasks mentioned in this section is the DOM manipulation task source.

8.2.7 Coercing an HTML DOM into an infoset

When an application uses an HTML parser in conjunction with an XML pipeline, it is possible that the constructed DOM is not compatible with the XML tool chain in certain subtle ways. For example, an XML toolchain might not be able to represent attributes with the name xmlns, since they conflict with the Namespaces in XML syntax. There is also some data that the HTML parser generates that isn't included in the DOM itself. This section specifies some rules for handling these issues.

If the XML API being used doesn't support DOCTYPEs, the tool may drop DOCTYPEs altogether.

If the XML API doesn't support attributes in no namespace that are named "xmlns", attributes whose names start with "xmlns:", or attributes in the XMLNS namespace, then the tool may drop such attributes.

The tool may annotate the output with any namespace declarations required for proper operation.

If the XML API being used restricts the allowable characters in the local names of elements and attributes, then the tool may map all element and attribute local names that the API wouldn't support to a set of names that are allowed, by replacing any character that isn't supported with the uppercase letter U and the six digits of the character's Unicode code point when expressed in hexadecimal, using digits 0-9 and capital letters A-F as the symbols, in increasing numeric order.

For example, the element name foo<bar, which can be output by the HTML parser, though it is neither a legal HTML element name nor a well-formed XML element name, would be converted into fooU00003Cbar, which is a well-formed XML element name (though it's still not legal in HTML by any means).

As another example, consider the attribute xlink:href. Used on a MathML element, it becomes, after being adjusted, an attribute with a prefix "xlink" and a local name "href". However, used on an HTML element, it becomes an attribute with no prefix and the local name "xlink:href", which is not a valid NCName, and thus might not be accepted by an XML API. It could thus get converted, becoming "xlinkU00003Ahref".

The resulting names from this conversion conveniently can't clash with any attribute generated by the HTML parser, since those are all either lowercase or those listed in the adjust foreign attributes algorithm's table.

If the XML API restricts comments from having two consecutive U+002D HYPHEN-MINUS characters (--), the tool may insert a single U+0020 SPACE character between any such offending characters.

If the XML API restricts comments from ending in a U+002D HYPHEN-MINUS character (-), the tool may insert a single U+0020 SPACE character at the end of such comments.

If the XML API restricts allowed characters in character data, attribute values, or comments, the tool may replace any U+000C FORM FEED (FF) character with a U+0020 SPACE character, and any other literal non-XML character with a U+FFFD REPLACEMENT CHARACTER.

If the tool has no way to convey out-of-band information, then the tool may drop the following information:

The mutations allowed by this section apply after the HTML parser's rules have been applied. For example, a <a::> start tag will be closed by a </a::> end tag, and never by a </aU00003AU00003A> end tag, even if the user agent is using the rules above to then generate an actual element in the DOM with the name aU00003AU00003A for that start tag.

8.2.8 An introduction to error handling and strange cases in the parser

This section is non-normative.

This section examines some erroneous markup and discusses how the HTML parser handles these cases.

8.2.8.1 Misnested tags: <b><i></b></i>

This section is non-normative.

The most-often discussed example of erroneous markup is as follows:

<p>1<b>2<i>3</b>4</i>5</p>

The parsing of this markup is straightforward up to the "3". At this point, the DOM looks like this:

Here, the stack of open elements has five elements on it: html, body, p, b, and i. The list of active formatting elements just has two: b and i. The insertion mode is "in body".

Upon receiving the end tag token with the tag name "b", the "adoption agency algorithm" is invoked. This is a simple case, in that the formatting element is the b element, and there is no furthest block. Thus, the stack of open elements ends up with just three elements: html, body, and p, while the list of active formatting elements has just one: i. The DOM tree is unmodified at this point.

The next token is a character ("4"), triggers the reconstruction of the active formatting elements, in this case just the i element. A new i element is thus created for the "4" text node. After the end tag token for the "i" is also received, and the "5" text node is inserted, the DOM looks as follows:

8.2.8.2 Misnested tags: <b><p></b></p>

This section is non-normative.

A case similar to the previous one is the following:

<b>1<p>2</b>3</p>

Up to the "2" the parsing here is straightforward:

The interesting part is when the end tag token with the tag name "b" is parsed.

Before that token is seen, the stack of open elements has four elements on it: html, body, b, and p. The list of active formatting elements just has the one: b. The insertion mode is "in body".

Upon receiving the end tag token with the tag name "b", the "adoption agency algorithm" is invoked, as in the previous example. However, in this case, there is a furthest block, namely the p element. Thus, this time the adoption agency algorithm isn't skipped over.

The common ancestor is the body element. A conceptual "bookmark" marks the position of the b in the list of active formatting elements, but since that list has only one element in it, the bookmark won't have much effect.

As the algorithm progresses, node ends up set to the formatting element (b), and last node ends up set to the furthest block (p).

The last node gets appended (moved) to the common ancestor, so that the DOM looks like:

A new b element is created, and the children of the p element are moved to it:

Finally, the new b element is appended to the p element, so that the DOM looks like:

The b element is removed from the list of active formatting elements and the stack of open elements, so that when the "3" is parsed, it is appended to the p element:

8.2.8.3 Unexpected markup in tables

This section is non-normative.

Error handling in tables is, for historical reasons, especially strange. For example, consider the following markup:

<table><b><tr><td>aaa</td></tr>bbb</table>ccc

The highlighted b element start tag is not allowed directly inside a table like that, and the parser handles this case by placing the element before the table. (This is called foster parenting.) This can be seen by examining the DOM tree as it stands just after the table element's start tag has been seen:

...and then immediately after the b element start tag has been seen:

At this point, the stack of open elements has on it the elements html, body, table, and b (in that order, despite the resulting DOM tree); the list of active formatting elements just has the b element in it; and the insertion mode is "in table".

The tr start tag causes the b element to be popped off the stack and a tbody start tag to be implied; the tbody and tr elements are then handled in a rather straight-forward manner, taking the parser through the "in table body" and "in row" insertion modes, after which the DOM looks as follows:

Here, the stack of open elements has on it the elements html, body, table, tbody, and tr; the list of active formatting elements still has the b element in it; and the insertion mode is "in row".

The td element start tag token, after putting a td element on the tree, puts a marker on the list of active formatting elements (it also switches to the "in cell" insertion mode).

The marker means that when the "aaa" character tokens are seen, no b element is created to hold the resulting text node:

The end tags are handled in a straight-forward manner; after handling them, the stack of open elements has on it the elements html, body, table, and tbody; the list of active formatting elements still has the b element in it (the marker having been removed by the "td" end tag token); and the insertion mode is "in table body".

Thus it is that the "bbb" character tokens are found. These trigger the "in table text" insertion mode to be used (with the original insertion mode set to "in table body"). The character tokens are collected, and when the next token (the table element end tag) is seen, they are processed as a group. Since they are not all spaces, they are handled as per the "anything else" rules in the "in table" insertion mode, which defer to the "in body" insertion mode but with foster parenting.

When the active formatting elements are reconstructed, a b element is created and foster parented, and then the "bbb" text node is appended to it:

The stack of open elements has on it the elements html, body, table, tbody, and the new b (again, note that this doesn't match the resulting tree!); the list of active formatting elements has the new b element in it; and the insertion mode is still "in table body".

Had the character tokens been only space characters instead of "bbb", then those space characters would just be appended to the tbody element.

Finally, the table is closed by a "table" end tag. This pops all the nodes from the stack of open elements up to and including the table element, but it doesn't affect the list of active formatting elements, so the "ccc" character tokens after the table result in yet another b element being created, this time after the table:

8.2.8.4 Scripts that modify the page as it is being parsed

This section is non-normative.

Consider the following markup, which for this example we will assume is the document with URL http://example.com/inner, being rendered as the content of an iframe in another document with the URL http://example.com/outer:

<div id=a>
 <script>
  var div = document.getElementById('a');
  parent.document.body.appendChild(div);
 </script>
 <script>
  alert(document.URL);
 </script>
</div>
<script>
 alert(document.URL);
</script>

Up to the first "script" end tag, before the script is parsed, the result is relatively straightforward:

After the script is parsed, though, the div element and its child script element are gone:

They are, at this point, in the Document of the aforementioned outer browsing context. However, the stack of open elements still contains the div element.

Thus, when the second script element is parsed, it is inserted into the outer Document object.

This also means that the script's global object is the outer browsing context's Window object, not the Window object inside the iframe.

This isn't a security problem since the script that moves the div into the outer Document can only do so because the two Document object have the same origin.

Thus, the first alert says "http://example.com/outer".

Once the div element's end tag is parsed, the div element is popped off the stack, and so the next script element is in the inner Document:

This second alert will say "http://example.com/inner".

8.2.8.5 The execution of scripts that are moving across multiple documents

This section is non-normative.

Elaborating on the example in the previous section, consider a case where a script element with a src attribute is parsed, but while the external script is being downloaded, the element is moved to another document.

In this case, the script's global object is that second document's browsing context's Window object, not the Window object of the document into which the element was parsed.

8.2.8.6 Unclosed formatting elements

This section is non-normative.

The following markup shows how nested formatting elements (such as b) get collected and continue to be applied even as the elements they are contained in are closed, but that excessive duplicates are thrown away.

<!DOCTYPE html>
<p><b class=x><b class=x><b><b class=x><b class=x><b>X
<p>X
<p><b><b class=x><b>X
<p></b></b></b></b></b></b>X

The resulting DOM tree is as follows:

Note how the second p element in the markup has no explicit b elements, but in the resulting DOM, up to three of each kind of formatting element (in this case three b elements with the class attribute, and two unadorned b elements) get reconstructed before the element's "X".

Also note how this means that in the final paragraph only six b end tags are needed to completely clear the list of formatting elements, even though nine b start tags have been seen up to this point.

8.3 Serializing HTML fragments

The following steps form the HTML fragment serialization algorithm. The algorithm takes as input a DOM Element, Document, or DocumentFragment referred to as the node, and either returns a string or raises an exception.

This algorithm serializes the children of the node being serialized, not the node itself.

  1. Let s be a string, and initialize it to the empty string.

  2. For each child node of the node, in tree order, run the following steps:

    1. Let current node be the child node being processed.

    2. Append the appropriate string from the following list to s:

      If current node is an Element

      If current node is an element in the HTML namespace, the MathML namespace, or the SVG namespace, then let tagname be current node's local name. Otherwise, let tagname be current node's qualified name.

      Append a U+003C LESS-THAN SIGN character (<), followed by tagname.

      For HTML elements created by the HTML parser or Document.createElement(), tagname will be lowercase.

      For each attribute that the element has, append a U+0020 SPACE character, the attribute's serialized name as described below, a U+003D EQUALS SIGN character (=), a U+0022 QUOTATION MARK character ("), the attribute's value, escaped as described below in attribute mode, and a second U+0022 QUOTATION MARK character (").

      An attribute's serialized name for the purposes of the previous paragraph must be determined as follows:

      If the attribute has no namespace

      The attribute's serialized name is the attribute's local name.

      For attributes on HTML elements set by the HTML parser or by Element.setAttributeNode() or Element.setAttribute(), the local name will be lowercase.

      If the attribute is in the XML namespace

      The attribute's serialized name is the string "xml:" followed by the attribute's local name.

      If the attribute is in the XMLNS namespace and the attribute's local name is xmlns

      The attribute's serialized name is the string "xmlns".

      If the attribute is in the XMLNS namespace and the attribute's local name is not xmlns

      The attribute's serialized name is the string "xmlns:" followed by the attribute's local name.

      If the attribute is in the XLink namespace

      The attribute's serialized name is the string "xlink:" followed by the attribute's local name.

      If the attribute is in some other namespace

      The attribute's serialized name is the attribute's qualified name.

      While the exact order of attributes is UA-defined, and may depend on factors such as the order that the attributes were given in the original markup, the sort order must be stable, such that consecutive invocations of this algorithm serialize an element's attributes in the same order.

      Append a U+003E GREATER-THAN SIGN character (>).

      If current node is an area, base, basefont, bgsound, br, col, command, embed, frame, hr, img, input, keygen, link, meta, param, source, track or wbr element, then continue on to the next child node at this point.

      If current node is a pre, textarea, or listing element, append a U+000A LINE FEED (LF) character.

      Append the value of running the HTML fragment serialization algorithm on the current node element (thus recursing into this algorithm for that element), followed by a U+003C LESS-THAN SIGN character (<), a U+002F SOLIDUS character (/), tagname again, and finally a U+003E GREATER-THAN SIGN character (>).

      If current node is a Text or CDATASection node

      If the parent of current node is a style, script, xmp, iframe, noembed, noframes, or plaintext element, or if the parent of current node is noscript element and scripting is enabled for the node, then append the value of current node's data IDL attribute literally.

      Otherwise, append the value of current node's data IDL attribute, escaped as described below.

      If current node is a Comment

      Append the literal string <!-- (U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS), followed by the value of current node's data IDL attribute, followed by the literal string --> (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN).

      If current node is a ProcessingInstruction

      Append the literal string <? (U+003C LESS-THAN SIGN, U+003F QUESTION MARK), followed by the value of current node's target IDL attribute, followed by a single U+0020 SPACE character, followed by the value of current node's data IDL attribute, followed by a single U+003E GREATER-THAN SIGN character (>).

      If current node is a DocumentType

      Append the literal string <!DOCTYPE (U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+0044 LATIN CAPITAL LETTER D, U+004F LATIN CAPITAL LETTER O, U+0043 LATIN CAPITAL LETTER C, U+0054 LATIN CAPITAL LETTER T, U+0059 LATIN CAPITAL LETTER Y, U+0050 LATIN CAPITAL LETTER P, U+0045 LATIN CAPITAL LETTER E), followed by a space (U+0020 SPACE), followed by the value of current node's name IDL attribute, followed by the literal string > (U+003E GREATER-THAN SIGN).

      Other node types (e.g. Attr) cannot occur as children of elements. If, despite this, they somehow do occur, this algorithm must raise an INVALID_STATE_ERR exception.

  3. The result of the algorithm is the string s.

Entity reference nodes are assumed to be expanded by the user agent, and are therefore not covered in the algorithm above.

It is possible that the output of this algorithm, if parsed with an HTML parser, will not return the original tree structure.

For instance, if a textarea element to which a Comment node has been appended is serialized and the output is then reparsed, the comment will end up being displayed in the text field. Similarly, if, as a result of DOM manipulation, an element contains a comment that contains the literal string "-->", then when the result of serializing the element is parsed, the comment will be truncated at that point and the rest of the comment will be interpreted as markup. More examples would be making a script element contain a text node with the text string "</script>", or having a p element that contains a ul element (as the ul element's start tag would imply the end tag for the p).

This can enable cross-site scripting attacks. An example of this would be a page that lets the user enter some font names that are then inserted into a CSS style block via the DOM and which then uses the innerHTML IDL attribute to get the HTML serialization of that style element: if the user enters "</style><script>attack</script>" as a font name, innerHTML will return markup that, if parsed in a different context, would contain a script node, even though no script node existed in the original DOM.

Escaping a string (for the purposes of the algorithm above) consists of running the following steps:

  1. Replace any occurrence of the "&" character by the string "&amp;".

  2. Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string "&nbsp;".

  3. If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string "&quot;".

  4. If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "&lt;", and any occurrences of the ">" character by the string "&gt;".

8.4 Parsing HTML fragments

The following steps form the HTML fragment parsing algorithm. The algorithm optionally takes as input an Element node, referred to as the context element, which gives the context for the parser, as well as input, a string to parse, and returns a list of zero or more nodes.

Parts marked fragment case in algorithms in the parser section are parts that only occur if the parser was created for the purposes of this algorithm (and with a context element). The algorithms have been annotated with such markings for informational purposes only; such markings have no normative weight. If it is possible for a condition described as a fragment case to occur even when the parser wasn't created for the purposes of handling this algorithm, then that is an error in the specification.

  1. Create a new Document node, and mark it as being an HTML document.

  2. If there is a context element, and the Document of the context element is in quirks mode, then let the Document be in quirks mode. Otherwise, if there is a context element, and the Document of the context element is in limited-quirks mode, then let the Document be in limited-quirks mode. Otherwise, leave the Document in no-quirks mode.

  3. Create a new HTML parser, and associate it with the just created Document node.

  4. If there is a context element, run these substeps:

    1. Set the state of the HTML parser's tokenization stage as follows:

      If it is a title or textarea element
      Switch the tokenizer to the RCDATA state.
      If it is a style, xmp, iframe, noembed, or noframes element
      Switch the tokenizer to the RAWTEXT state.
      If it is a script element
      Switch the tokenizer to the script data state.
      If it is a noscript element
      If the scripting flag is enabled, switch the tokenizer to the RAWTEXT state. Otherwise, leave the tokenizer in the data state.
      If it is a plaintext element
      Switch the tokenizer to the PLAINTEXT state.
      Otherwise
      Leave the tokenizer in the data state.

      For performance reasons, an implementation that does not report errors and that uses the actual state machine described in this specification directly could use the PLAINTEXT state instead of the RAWTEXT and script data states where those are mentioned in the list above. Except for rules regarding parse errors, they are equivalent, since there is no appropriate end tag token in the fragment case, yet they involve far fewer state transitions.

    2. Let root be a new html element with no attributes.

    3. Append the element root to the Document node created above.

    4. Set up the parser's stack of open elements so that it contains just the single element root.

    5. Reset the parser's insertion mode appropriately.

      The parser will reference the context element as part of that algorithm.

    6. Set the parser's form element pointer to the nearest node to the context element that is a form element (going straight up the ancestor chain, and including the element itself, if it is a form element), or, if there is no such form element, to null.

  5. Place into the input stream for the HTML parser just created the input. The encoding confidence is irrelevant.

  6. Start the parser and let it run until it has consumed all the characters just inserted into the input stream.

  7. If there is a context element, return the child nodes of root, in tree order.

    Otherwise, return the children of the Document object, in tree order.