This is a work in progress! For the latest updates from the HTML WG, possibly including important bug fixes, please look at the editor's draft instead.
This section only describes the rules for XML
resources. Rules for
text/html resources are discussed
in the section above entitled "The HTML syntax".
The syntax for using HTML with XML, whether in XHTML documents or embedded in other XML documents, is defined in the XML and Namespaces in XML specifications. [XML] [XMLNS]
This specification does not define any syntax-level requirements beyond those defined for XML proper.
XML documents may contain a
DOCTYPE if desired, but
this is not required to conform to this specification. This
specification does not define a public or system identifier, nor
provide a format DTD.
According to the XML specification, XML processors
are not guaranteed to process the external DTD subset referenced in
the DOCTYPE. This means, for example, that using entity references
for characters in XHTML documents is unsafe if they are defined in
an external file (except for
This section describes the relationship between XML and the DOM, with a particular emphasis on how this interacts with HTML.
An XML parser, for the purposes of this specification,
is a construct that follows the rules given in the XML specification
to map a string of bytes or characters into a
Document must then be populated with DOM nodes
that represent the tree structure of the input passed to the parser,
as defined by the XML specification, the Namespaces in XML
specification, and the DOM Core specification. DOM mutation events
must not fire for the operations that the XML parser
performs on the
Document's tree, but the user agent
must act as if elements and attributes were individually appended
and set respectively so as to trigger rules in this specification
regarding what happens when an element is inserted into a document
or has its attributes set. [XML] [XMLNS] [DOMCORE]
Between the time an element's start tag is parsed and the time either the element's end tag is parsed or the parser detects a well-formedness error, the user agent must act as if the element was in a stack of open elements.
This specification provides the following additional information that user agents should use when retrieving an external entity: the public identifiers given in the following list all correspond to the URL given by this link.
-//W3C//DTD XHTML 1.0 Transitional//EN
-//W3C//DTD XHTML 1.1//EN
-//W3C//DTD XHTML 1.0 Strict//EN
-//W3C//DTD XHTML 1.0 Frameset//EN
-//W3C//DTD XHTML Basic 1.0//EN
-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN
-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN
-//W3C//DTD MathML 2.0//EN
-//WAPFORUM//DTD XHTML Mobile 1.0//EN
Furthermore, user agents should attempt to retrieve the above external entity's content when one of the above public identifiers is used, and should not attempt to retrieve any other external entity's content.
This is not strictly a violation of the XML specification, but it does contradict the spirit of the XML specification's requirements. This is motivated by a desire for user agents to all handle entities in an interoperable fashion without requiring any network access for handling external subsets. [XML]
When an XML parser creates a
script element, it must be marked as being
"parser-inserted". If the parser was originally
created for the XML fragment parsing algorithm, then
the element must be marked as "already started"
also. When the element's end tag is parsed, the user agent must
element. If this causes there to be a pending parsing-blocking
script, then the user agent must run the following steps:
There is no longer a pending parsing-blocking script.
Certain algorithms in this specification spoon-feed the parser characters one string at a time. In such cases, the XML parser must act as it would have if faced with a single string consisting of the concatenation of all those characters.
When an XML parser reaches the end of its input, it must stop parsing, following the same rules as the HTML parser. An XML parser can also be aborted, which must again by done in the same way as for an HTML parser.
In both cases, the string returned must be XML
namespace-well-formed and must be an isomorphic serialization of all
of that node's child nodes, in tree order. User agents
may adjust prefixes and namespace declarations in the serialization
(and indeed might be forced to do so in some cases to obtain
namespace-well-formed XML). User agents may use a combination of
regular text, character references, and CDATA sections to represent
text nodes in the DOM (and indeed
might be forced to use representations that don't match the DOM's,
e.g. if a
CDATASection node contains the string "
Elements, if any of the elements in the
serialization are in no namespace, the default namespace in scope
for those elements must be explicitly declared as the empty
string. (This doesn't
apply in the
Document case.) [XML] [XMLNS]
For the purposes of this section, an internal general parsed entity is considered XML namespace-well-formed if a document consisting of an element with no namespace declarations whose contents are the internal general parsed entity would itself be XML namespace-well-formed.
If any of the following error cases are found in the DOM subtree
being serialized, then the algorithm must raise an
INVALID_STATE_ERR exception instead of returning a
Documentnode with no child element nodes.
DocumentTypenode that has an external subset public identifier that contains characters that are not matched by the XML
DocumentTypenode that has an external subset system identifier that contains both a U+0022 QUOTATION MARK (") and a U+0027 APOSTROPHE (') or that contains characters that are not matched by the XML
Attrnode with no namespace whose local name is the lowercase string "
Elementnode with two or more attributes with the same local name and namespace.
ProcessingInstructionnode whose data contains characters that are not matched by the XML
Commentnode whose data contains two adjacent U+002D HYPHEN-MINUS characters (-) or ends with such a character.
ProcessingInstructionnode whose target name is an ASCII case-insensitive match for the string "
ProcessingInstructionnode whose target name contains a U+003A COLON (:).
ProcessingInstructionnode whose data contains the string "
These are the only ways to make a DOM
unserializable. The DOM enforces all the other XML constraints; for
example, trying to append two elements to a
node will raise a
Create a new XML parser.
If there is a context element, feed the parser just created the string corresponding to the start tag of that element, declaring all the namespace prefixes that are in scope on that element in the DOM, as well as declaring the default namespace (if any) that is in scope on that element in the DOM.
A namespace prefix is in scope if the DOM Core
lookupNamespaceURI() method on the element would
return a non-null value for that prefix.
The default namespace is the namespace for which the DOM Core
isDefaultNamespace() method on the element
would return true.
If there is a context element,
DOCTYPE is passed to the parser, and
therefore no external subset is referenced, and therefore no
entities will be recognized.
Feed the parser just created the string input.
If there is a context element, feed the parser just created the string corresponding to the end tag of that element.
If there is an XML well-formedness or XML namespace
well-formedness error, then raise a
exception and abort these steps.