Once the user agent stops parsing the document, the user agent must run the following steps:
Set the current document readiness to "interactive" and the insertion point to undefined.
Pop all the nodes off the stack of open elements.
If the list of scripts that will execute when the document has finished parsing is not empty, run these substeps:
Spin the event loop until the first
script
in the list of scripts that will
execute when the document has finished parsing has its
"ready to be parser-executed" flag set and
the parser's Document
has no style sheet that
is blocking scripts.
Execute the
first script
in the list of scripts that will
execute when the document has finished parsing.
Remove the first script
element from the
list of scripts that will execute when the document has
finished parsing (i.e. shift out the first entry in the
list).
If the list of scripts that will execute when the document has finished parsing is still not empty, repeat these substeps again from substep 1.
Queue a task to fire a simple
event that bubbles named DOMContentLoaded
at the
Document
.
Spin the event loop until the set of scripts that will execute as soon as possible and the list of scripts that will execute in order as soon as possible are empty.
Spin the event loop until there is nothing that
delays the load event in
the Document
.
Queue a task to set the current document readiness to "complete".
If the Document
is in a browsing
context, then queue a task to fire a
simple event named load
at
the Document
's Window
object, but with
its target
set to the
Document
object (and the currentTarget
set to the
Window
object).
If the Document
is in a browsing
context, then queue a task to fire a pageshow
event at the
Window
object of the Document
, but with
its target
set to the
Document
object (and the currentTarget
set to the
Window
object), using the
PageTransitionEvent
interface, with the persisted
attribute set to false. This event must not bubble, must not be
cancelable, and has no default action.
If the Document
has any pending
application cache download process tasks, then queue each such task in the order they were added to
the list of pending application cache download process
tasks, and then empty the list of pending application
cache download process tasks. The task source
for these tasks is the
networking task source.
The Document
is now ready for post-load
tasks.
Queue a task to mark the Document
as completely loaded.
When the user agent is to abort a parser, it must run the following steps:
Throw away any pending content in the input stream, and discard any future content that would have been added to it.
Pop all the nodes off the stack of open elements.
Except where otherwise specified, the task source for the tasks mentioned in this section is the DOM manipulation task source.
When an application uses an HTML parser in
conjunction with an XML pipeline, it is possible that the
constructed DOM is not compatible with the XML tool chain in certain
subtle ways. For example, an XML toolchain might not be able to
represent attributes with the name xmlns
,
since they conflict with the Namespaces in XML syntax. There is also
some data that the HTML parser generates that isn't
included in the DOM itself. This section specifies some rules for
handling these issues.
If the XML API being used doesn't support DOCTYPEs, the tool may drop DOCTYPEs altogether.
If the XML API doesn't support attributes in no namespace that
are named "xmlns
", attributes whose names
start with "xmlns:
", or attributes in the
XMLNS namespace, then the tool may drop such
attributes.
The tool may annotate the output with any namespace declarations required for proper operation.
If the XML API being used restricts the allowable characters in the local names of elements and attributes, then the tool may map all element and attribute local names that the API wouldn't support to a set of names that are allowed, by replacing any character that isn't supported with the uppercase letter U and the six digits of the character's Unicode code point when expressed in hexadecimal, using digits 0-9 and capital letters A-F as the symbols, in increasing numeric order.
For example, the element name foo<bar
, which can be output by the HTML
parser, though it is neither a legal HTML element name nor a
well-formed XML element name, would be converted into fooU00003Cbar
, which is a well-formed XML
element name (though it's still not legal in HTML by any means).
As another example, consider the attribute
xlink:href
. Used on a MathML element, it becomes, after
being adjusted, an
attribute with a prefix "xlink
" and a local
name "href
". However, used on an HTML element,
it becomes an attribute with no prefix and the local name "xlink:href
", which is not a valid NCName, and thus
might not be accepted by an XML API. It could thus get converted,
becoming "xlinkU00003Ahref
".
The resulting names from this conversion conveniently can't clash with any attribute generated by the HTML parser, since those are all either lowercase or those listed in the adjust foreign attributes algorithm's table.
If the XML API restricts comments from having two consecutive U+002D HYPHEN-MINUS characters (--), the tool may insert a single U+0020 SPACE character between any such offending characters.
If the XML API restricts comments from ending in a U+002D HYPHEN-MINUS character (-), the tool may insert a single U+0020 SPACE character at the end of such comments.
If the XML API restricts allowed characters in character data, attribute values, or comments, the tool may replace any U+000C FORM FEED (FF) character with a U+0020 SPACE character, and any other literal non-XML character with a U+FFFD REPLACEMENT CHARACTER.
If the tool has no way to convey out-of-band information, then the tool may drop the following information:
form
element ancestor (use of the
form
element pointer in the parser)The mutations allowed by this section apply
after the HTML parser's rules have been
applied. For example, a <a::>
start tag
will be closed by a </a::>
end tag, and
never by a </aU00003AU00003A>
end tag, even
if the user agent is using the rules above to then generate an
actual element in the DOM with the name aU00003AU00003A
for that start tag.
This section is non-normative.
This section examines some erroneous markup and discusses how the HTML parser handles these cases.
This section is non-normative.
The most-often discussed example of erroneous markup is as follows:
<p>1<b>2<i>3</b>4</i>5</p>
The parsing of this markup is straightforward up to the "3". At this point, the DOM looks like this:
Here, the stack of open elements has five elements
on it: html
, body
, p
,
b
, and i
. The list of active
formatting elements just has two: b
and
i
. The insertion mode is "in body".
Upon receiving the end tag token with the tag name "b", the "adoption agency algorithm" is
invoked. This is a simple case, in that the formatting
element is the b
element, and there is no
furthest block. Thus, the stack of open
elements ends up with just three elements: html
,
body
, and p
, while the list of
active formatting elements has just one: i
. The
DOM tree is unmodified at this point.
The next token is a character ("4"), triggers the reconstruction of
the active formatting elements, in this case just the
i
element. A new i
element is thus created
for the "4" text node. After the end tag token for the "i" is also
received, and the "5" text node is inserted, the DOM looks as
follows:
This section is non-normative.
A case similar to the previous one is the following:
<b>1<p>2</b>3</p>
Up to the "2" the parsing here is straightforward:
The interesting part is when the end tag token with the tag name "b" is parsed.
Before that token is seen, the stack of open
elements has four elements on it: html
,
body
, b
, and p
. The
list of active formatting elements just has the one:
b
. The insertion mode is "in body".
Upon receiving the end tag token with the tag name "b", the "adoption agency algorithm" is invoked, as
in the previous example. However, in this case, there is a
furthest block, namely the p
element. Thus,
this time the adoption agency algorithm isn't skipped over.
The common ancestor is the body
element. A conceptual "bookmark" marks the position of the
b
in the list of active formatting
elements, but since that list has only one element in it,
the bookmark won't have much effect.
As the algorithm progresses, node ends up set
to the formatting element (b
), and last
node ends up set to the furthest block
(p
).
The last node gets appended (moved) to the common ancestor, so that the DOM looks like:
A new b
element is created, and the children of the
p
element are moved to it:
b
#text
: 2Finally, the new b
element is appended to the
p
element, so that the DOM looks like:
The b
element is removed from the list of
active formatting elements and the stack of open
elements, so that when the "3" is parsed, it is appended to
the p
element:
This section is non-normative.
Error handling in tables is, for historical reasons, especially strange. For example, consider the following markup:
<table><b><tr><td>aaa</td></tr>bbb</table>ccc
The highlighted b
element start tag is not allowed
directly inside a table like that, and the parser handles this case
by placing the element before the table. (This is called foster parenting.) This can be seen by
examining the DOM tree as it stands just after the
table
element's start tag has been seen:
...and then immediately after the b
element start
tag has been seen:
At this point, the stack of open elements has on it
the elements html
, body
,
table
, and b
(in that order, despite the
resulting DOM tree); the list of active formatting
elements just has the b
element in it; and the
insertion mode is "in table".
The tr
start tag causes the b
element
to be popped off the stack and a tbody
start tag to be
implied; the tbody
and tr
elements are
then handled in a rather straight-forward manner, taking the parser
through the "in table
body" and "in
row" insertion modes, after which the DOM looks as
follows:
Here, the stack of open elements has on it the
elements html
, body
, table
,
tbody
, and tr
; the list of active
formatting elements still has the b
element in
it; and the insertion mode is "in row".
The td
element start tag token, after putting a
td
element on the tree, puts a marker on the list
of active formatting elements (it also switches to the "in cell" insertion
mode).
The marker means that when the "aaa" character tokens are seen,
no b
element is created to hold the resulting text
node:
The end tags are handled in a straight-forward manner; after
handling them, the stack of open elements has on it the
elements html
, body
, table
,
and tbody
; the list of active formatting
elements still has the b
element in it (the
marker having been removed by the "td" end tag token); and the
insertion mode is "in table body".
Thus it is that the "bbb" character tokens are found. These
trigger the "in table
text" insertion mode to be used (with the original
insertion mode set to "in table body"). The character tokens are collected,
and when the next token (the table
element end tag) is
seen, they are processed as a group. Since they are not all spaces,
they are handled as per the "anything else" rules in the "in table" insertion mode,
which defer to the "in
body" insertion mode but with foster parenting.
When the
active formatting elements are reconstructed, a
b
element is created and foster parented, and then the "bbb" text node is
appended to it:
The stack of open elements has on it the elements
html
, body
, table
,
tbody
, and the new b
(again, note that
this doesn't match the resulting tree!); the list of active
formatting elements has the new b
element in it;
and the insertion mode is still "in table body".
Had the character tokens been only space characters instead of "bbb", then those
space characters would just be
appended to the tbody
element.
Finally, the table
is closed by a "table" end
tag. This pops all the nodes from the stack of open
elements up to and including the table
element,
but it doesn't affect the list of active formatting
elements, so the "ccc" character tokens after the table
result in yet another b
element being created, this
time after the table:
This section is non-normative.
Consider the following markup, which for this example we will
assume is the document with URL http://example.com/inner
, being rendered as the
content of an iframe
in another document with the
URL http://example.com/outer
:
<div id=a> <script> var div = document.getElementById('a'); parent.document.body.appendChild(div); </script> <script> alert(document.URL); </script> </div> <script> alert(document.URL); </script>
Up to the first "script" end tag, before the script is parsed, the result is relatively straightforward:
After the script is parsed, though, the div
element
and its child script
element are gone:
They are, at this point, in the Document
of the
aforementioned outer browsing context. However, the
stack of open elements still contains the
div
element.
Thus, when the second script
element is parsed, it
is inserted into the outer Document
object.
This also means that the script's global object is
the outer browsing context's Window
object, not the Window
object inside the
iframe
.
This isn't a security problem since the script that
moves the div
into the outer Document
can
only do so because the two Document
object have the
same origin.
Thus, the first alert says "http://example.com/outer".
Once the div
element's end tag is parsed, the
div
element is popped off the stack, and so the next
script
element is in the inner Document
:
This second alert will say "http://example.com/inner".
This section is non-normative.
Elaborating on the example in the previous section, consider a
case where a script
element with a src
attribute is parsed, but while
the external script is being downloaded, the element is moved to
another document.
In this case, the script's global object is that
second document's browsing context's
Window
object, not the Window
object of
the document into which the element was parsed.
This section is non-normative.
The following markup shows how nested formatting elements (such
as b
) get collected and continue to be applied even as
the elements they are contained in are closed, but that excessive
duplicates are thrown away.
<!DOCTYPE html> <p><b class=x><b class=x><b><b class=x><b class=x><b>X <p>X <p><b><b class=x><b>X <p></b></b></b></b></b></b>X
The resulting DOM tree is as follows:
Note how the second p
element in the markup has no
explicit b
elements, but in the resulting DOM, up to
three of each kind of formatting element (in this case three
b
elements with the class attribute, and two unadorned
b
elements) get reconstructed before the element's
"X".
Also note how this means that in the final paragraph only six
b
end tags are needed to completely clear the list of
formatting elements, even though nine b
start tags have
been seen up to this point.
The following steps form the HTML fragment serialization
algorithm. The algorithm takes as input a DOM
Element
, Document
, or
DocumentFragment
referred to as the
node, and either returns a string or raises an exception.
This algorithm serializes the children of the node being serialized, not the node itself.
Let s be a string, and initialize it to the empty string.
For each child node of the node, in tree order, run the following steps:
Let current node be the child node being processed.
Append the appropriate string from the following list to s:
Element
If current node is an element in the HTML namespace, the MathML namespace, or the SVG namespace, then let tagname be current node's local name. Otherwise, let tagname be current node's qualified name.
Append a U+003C LESS-THAN SIGN character (<), followed by tagname.
For HTML elements created by the
HTML parser or Document.createElement()
, tagname will be lowercase.
For each attribute that the element has, append a U+0020 SPACE character, the attribute's serialized name as described below, a U+003D EQUALS SIGN character (=), a U+0022 QUOTATION MARK character ("), the attribute's value, escaped as described below in attribute mode, and a second U+0022 QUOTATION MARK character (").
An attribute's serialized name for the purposes of the previous paragraph must be determined as follows:
The attribute's serialized name is the attribute's local name.
For attributes on HTML elements
set by the HTML parser or by Element.setAttributeNode()
or Element.setAttribute()
, the local name will
be lowercase.
The attribute's serialized name is the string "xml:
" followed by the attribute's local
name.
xmlns
The attribute's serialized name is the string "xmlns
".
xmlns
The attribute's serialized name is the string "xmlns:
" followed by the attribute's local
name.
The attribute's serialized name is the string "xlink:
" followed by the attribute's local
name.
The attribute's serialized name is the attribute's qualified name.
While the exact order of attributes is UA-defined, and may depend on factors such as the order that the attributes were given in the original markup, the sort order must be stable, such that consecutive invocations of this algorithm serialize an element's attributes in the same order.
Append a U+003E GREATER-THAN SIGN character (>).
If current node is an
area
, base
, basefont
,
bgsound
, br
, col
,
command
, embed
, frame
,
hr
, img
, input
,
keygen
, link
, meta
,
param
, source
, track
or
wbr
element, then continue on to the next child
node at this point.
If current node is a pre
,
textarea
, or listing
element, append
a U+000A LINE FEED (LF) character.
Append the value of running the HTML fragment serialization algorithm on the current node element (thus recursing into this algorithm for that element), followed by a U+003C LESS-THAN SIGN character (<), a U+002F SOLIDUS character (/), tagname again, and finally a U+003E GREATER-THAN SIGN character (>).
Text
or CDATASection
nodeIf the parent of current node is a
style
, script
, xmp
,
iframe
, noembed
,
noframes
, or plaintext
element, or
if the parent of current node is
noscript
element and scripting is enabled for the
node, then append the value of current
node's data
IDL attribute
literally.
Otherwise, append the value of current
node's data
IDL attribute, escaped as described
below.
Comment
Append the literal string <!--
(U+003C
LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D HYPHEN-MINUS,
U+002D HYPHEN-MINUS), followed by the value of current node's data
IDL
attribute, followed by the literal string -->
(U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN
SIGN).
ProcessingInstruction
Append the literal string <?
(U+003C
LESS-THAN SIGN, U+003F QUESTION MARK), followed by the value
of current node's target
IDL attribute, followed by a single
U+0020 SPACE character, followed by the value of current node's data
IDL
attribute, followed by a single U+003E GREATER-THAN SIGN
character (>).
DocumentType
Append the literal string <!DOCTYPE
(U+003C
LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+0044 LATIN CAPITAL
LETTER D, U+004F LATIN CAPITAL LETTER O, U+0043 LATIN CAPITAL
LETTER C, U+0054 LATIN CAPITAL LETTER T, U+0059 LATIN CAPITAL
LETTER Y, U+0050 LATIN CAPITAL LETTER P, U+0045 LATIN CAPITAL
LETTER E), followed by a space (U+0020 SPACE), followed by the
value of current node's name
IDL attribute, followed by the literal
string >
(U+003E GREATER-THAN SIGN).
Other node types (e.g. Attr
) cannot
occur as children of elements. If, despite this, they somehow do
occur, this algorithm must raise an
INVALID_STATE_ERR
exception.
The result of the algorithm is the string s.
Entity reference nodes are assumed to be expanded by the user agent, and are therefore not covered in the algorithm above.
It is possible that the output of this algorithm, if parsed with an HTML parser, will not return the original tree structure.
For instance, if a textarea
element to which a
Comment
node has been appended is serialized
and the output is then reparsed, the comment will end up being
displayed in the text field. Similarly, if, as a result of DOM
manipulation, an element contains a comment that contains the
literal string "-->
", then when the result
of serializing the element is parsed, the comment will be truncated
at that point and the rest of the comment will be interpreted as
markup. More examples would be making a script
element
contain a text node with the text string
"</script>
", or having a p
element
that contains a ul
element (as the ul
element's start tag would
imply the end tag for the p
).
This can enable cross-site scripting attacks. An example of this
would be a page that lets the user enter some font names that are
then inserted into a CSS style
block via the DOM and
which then uses the innerHTML
IDL attribute to get the HTML serialization of that
style
element: if the user enters
"</style><script>attack</script>
" as a font
name, innerHTML
will return
markup that, if parsed in a different context, would contain a
script
node, even though no script
node
existed in the original DOM.
Escaping a string (for the purposes of the algorithm above) consists of running the following steps:
Replace any occurrence of the "&
"
character by the string "&
".
Replace any occurrences of the U+00A0 NO-BREAK SPACE
character by the string "
".
If the algorithm was invoked in the attribute mode,
replace any occurrences of the ""
"
character by the string ""
".
If the algorithm was not invoked in the
attribute mode, replace any occurrences of the "<
" character by the string "<
", and any occurrences of the ">
" character by the string ">
".
The following steps form the HTML fragment parsing
algorithm. The algorithm optionally takes as input an
Element
node, referred to as the context element, which gives the context for the
parser, as well as input, a string to parse, and
returns a list of zero or more nodes.
Parts marked fragment case in algorithms in the parser section are parts that only occur if the parser was created for the purposes of this algorithm (and with a context element). The algorithms have been annotated with such markings for informational purposes only; such markings have no normative weight. If it is possible for a condition described as a fragment case to occur even when the parser wasn't created for the purposes of handling this algorithm, then that is an error in the specification.
Create a new Document
node, and mark it as being
an HTML document.
If there is a context element, and the
Document
of the context element
is in quirks mode, then let the Document
be in quirks mode. Otherwise, if there is a context element, and the Document
of
the context element is in limited-quirks
mode, then let the Document
be in
limited-quirks mode. Otherwise, leave the
Document
in no-quirks mode.
Create a new HTML parser, and associate it with
the just created Document
node.
If there is a context element, run these substeps:
Set the state of the HTML parser's tokenization stage as follows:
title
or textarea
elementstyle
, xmp
,
iframe
, noembed
, or
noframes
elementscript
elementnoscript
elementplaintext
elementFor performance reasons, an implementation that does not report errors and that uses the actual state machine described in this specification directly could use the PLAINTEXT state instead of the RAWTEXT and script data states where those are mentioned in the list above. Except for rules regarding parse errors, they are equivalent, since there is no appropriate end tag token in the fragment case, yet they involve far fewer state transitions.
Let root be a new html
element
with no attributes.
Append the element root to the
Document
node created above.
Set up the parser's stack of open elements so that it contains just the single element root.
Reset the parser's insertion mode appropriately.
The parser will reference the context element as part of that algorithm.
Set the parser's form
element pointer
to the nearest node to the context element
that is a form
element (going straight up the
ancestor chain, and including the element itself, if it is a
form
element), or, if there is no such
form
element, to null.
Place into the input stream for the HTML parser just created the input. The encoding confidence is irrelevant.
Start the parser and let it run until it has consumed all the characters just inserted into the input stream.
If there is a context element, return the child nodes of root, in tree order.
Otherwise, return the children of the Document
object, in tree order.