This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document summarizes design guidelines for authors who wish their XHTML or HTML documents to validate on either HTML or XML parsers, assuming the parsers to be HTML5-compliant. This specification is intended to be used by web authors. It is not a specification for user agents and creates no obligations on user agents. Note that this recommendation does not define how HTML5-conforming user agents should process HTML documents. Nor does it define the meaning of the Internet Media Type text/html. For user agent guidance and for these definitions, see [HTML5] and [RFC2854].
This document was published by the HTML working group as a Working Draft. This document is intended to become a W3C Recommendation. Please submit comments regarding this document by using the W3C's public bug database ( http://www.w3.org/Bugs/Public/) with the product set to HTML WG and the component set to HTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff). If you cannot access the bug database, submit comments to firstname.lastname@example.org@w3.org (subscribe, archives) and arrangements will be made to transpose the comments to the bug database. All feedback is welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This section is non-normative.
It is often valuable to be able to serve HTML5 documents that are also valid XML documents. An author may, for example, use XML tools to generate a document, and they and others may process the document using XML tools. These documents are served as text/html. The language used to create documents that can be parsed by both HTML and XML parsers is called polyglot markup. Polyglot markup is the overlap language of documents which are both HTML5 documents and XML documents.
Processing Instructions and the XML Declaration are both forbidden in polyglot markup.
Polyglot markup uses either UTF-8 or UTF-16. UTF-8 is preferred. When polyglot markup uses UTF-16, it must not include a BOM. When polyglot markup uses UTF-16, it must include the BOM indicating little-endian UTF-16 or big-endian UTF-16.
Polyglot markup declares character encoding one of two ways:
Content-type: text/html; charset=utf-8
Content-type: text/html; charset=utf-16
application/xhtml+xmlfor the value of the content type.
<meta charset="*"/> has no effect in XML. Therefore,
polyglot markup may use
charset="*"/> in combination with BOM, as long the meta element
specifies the same character encoding as the BOM. In addition, the meta
tag may be used in the absence of a
BOM as long as it matches the already specified encoding. Note that the
W3C Internationalization (i18n) Group recommends
to always include a visible encoding declaration in a document, because
the declaration helps developers, testers, or translation production
managers to check the encoding of a document visually.
DOCTYPEis in uppercase letters.
htmlis in lowercase letters.
SYSTEM, if present, is in uppercase letters.
PUBLIC, if present, is in uppercase letters.
about:legacy-compat, the string must be in lowercase, as required by HTML5.
Note that polyglot markup cannot use document type declarations for HTML4, HTML3, or HTML2, regardless of whether they contain a URI or not and regardless of their effect in HTML5 parsers, as these document type declarations are not compatible with XHTML.
The following rules apply to namespaces used in polyglot markup.
introduces undeclared (native) default namespaces for the root HTML
<html>, the root SVG element
and the root MathML element
<math>. The following
default namespaces must be
declared in polyglot markup, to maintain XML-compatibility [XML10]:
Polyglot markup must not declare any other default or prefixed element namespace, because [HTML5] does not natively support the declaring of any other default or prefixed element namespace.
introduces undeclared (native) support for attributes in the XLink
namespace and with the prefix
xlink:. Polyglot markup
must declare the XLink namespace on the HTML root element (
or once on the foreign element where is used (
<mathml>), to maintain XML-compatibility [XML10].
In polyglot markup, the xlink prefix uses the namespace declaration
before using the xlink prefix for the following elements:
<html>element or any other HTML element.
Note that there are other prefixed attributes that can be used
xlink:href (such as
Polyglot markup does not declare these prefixes via xmlns. The
prefixes are implicitly declared in XML and are automatically
applied to the appropriate attributes in HTML.
Polyglot markup conforms to the following rules regarding elements.
Polyglot markup must
explicitly have a
tbody element surrounding groups of
elements within a
table element. HTML parsers insert
tbody element, but XML parsers do not, thus
creating different DOMs.
<table> <tbody> <tr>...Incorrect:
Polyglot markup must
explicitly have a
colgroup element surrounding groups
col elements within a
table element. HTML
parsers insert the
colgroup element, but XML parsers do
not, thus creating different DOMs.
<table> <colgroup> <col>...Incorrect:
Polyglot markup does not use the
<noscript> element must not be used in XML
The following guidelines apply to any usage of element names, attribute names, or attribute values in markup, script, or CSS. Polyglot markup uses lower case letters for all ASCII letters. For non-ASCII letters—such as Greek, Cyrillic, or non-ASCII Latin letters—polyglot markup respects case sensitivity as it is called for.
Polyglot markup uses the correct case for element names.
Polyglot markup uses the correct case for attribute names.
definitionurlmust be changed to the mixed case
Polyglot markup uses lowercase letters for the values of the attributes in the following list when they exist on HTML elements. More specifically, where required, polyglot markup must use lower case letters for all ASCII letters in these attribute values; however, polyglot markup respects case sensitivity for non-ASCII letters such as Greek, Cyrillic, or non-ASCII Latin letters. For attribute values on HTML elements other than those in the following list, polyglot markup may use mixed case letters.
Because XML is case sensitive, polyglot markup also requires case to be consistent for values between markup, DOM APIs, and CSS. In addition, polyglot markup respects the case sensitivity of all other attribute values. Although polyglot markup must always have lowercase values of the attributes in the following list when they exist on HTML elements, attributes not in this list and attributes on non-HTML elements may have values made of mixed case letters. Note that other specifications, such as RDFa, may place additional restrictions on the allowed values of certain attributes.
Polyglot markup uses only the elements in the following list as empty elements.
Polyglot markup uses the minimized tag syntax for empty elements,
<br/>. The alternative syntax
allowed by XML gives uncertain results in many existing user agents.
Given an empty instance of an element whose content model is not
EMPTY (for example, an empty title or paragraph) polyglot markup
does not use the minimized form (e.g. the document uses
<p></p> and not
Note that MathML and SVG elements may be either self-closing or contain content.
The following elements or their considerations require exceptions to the general rules for polyglot markup.
<pre>element must not begin with white space.
Because of attribute-value normalization in XML [XML10], polyglot markup does not contain tabs, line feeds, or carriage returns within CDATA attributes.
Polyglot markup surrounds all attribute values with quotation marks. Attribute values may be surrounded either by single quotation marks or by double quotation marks.
See also Attribute Values.
The following attributes are not allowed in polyglot markup. These attributes have effects in documents parsed as XML but do not have effects in documents parsed as text/html. The HTML5 spec therefore defines them as invalid in text/html documents. [HTML5]
Note that the
attributes are allowed on SVG and MathML elements.
When using language attributes, polyglot markup
must use both the
xml:lang attributes. Neither
attribute is to be used without the other, and the values for both
must be the same.
Polyglot markup should use
the language attributes in the
html element to set the
default language for the document.
Polyglot markup uses only the following named entity references:
For entities beyond the previous list, a polyglot document uses
character references. For example, polyglot markup uses
. Note that polyglot markup may use
decimal values for escape characters (such as   in the previous
example); however, the
Character Model for the World Wide Web recommends that content
should use the hexadecimal form of character escapes rather than
the decimal form when there are both. [CHARMOD]
Script and style commands should be included by linking to external files rather than including them in-line. However, polyglot markup must not link to an external stylesheet by using the xml-stylesheet processing instruction. See also Processing Instructions and the XML Declaration.
The following examples show the proper way to include external script and style, respectively:
<link rel="stylesheet" href="external.css"/>
are valid in an HTML document, neither function may be used in XHTML.
Therefore, neither is used in polyglot markup. Instead, use the
innerHTML property for both HTML and XHTML. Note that the
innerHTML property takes a string. XML parsers parse the string
as XML in XHTML. HTML parsers parse the string as HTML in HTML. Because
of the difference in parsing, if you send the parser content that does
not follow the rules for polyglot markup the results will differ for a
DOM create with an XML parser and one created with an HTML parser.
Polyglot markup uses external scripts if that document's script or
style sheet uses
--. Note that XML parsers are permitted to silently
remove the contents of comments; therefore, the historical practice
of hiding scripts and style sheets within comments to make the
documents backward compatible is likely to not work as expected in
XML-based user agents.
If polyglot markup must use script or style commands within its
source code, either use safe content or wrap the command in a
CDATA section. However, polyglot markup does not use a
CDATA section unless it is being used within foreign content.
Safe content is content that does not contain a
& character. The following example is safe
because it does not contain problematic characters within
Note that you cannot achieve same DOM in both XHTML and HTML
by using in-line commands in a CDATA section. However, this
is not usally a problem unless the code has a dependency on
the exact number of text nodes under a
<style> element. The following examples show
in-line script and style commands wrapped in a
<script> //<![CDATA[ (script goes here) //]]> </script>
<style> /*<![CDATA[*/ (styles go here) /*]]>*/ </style>
When using MathML or SVG, the parser follows the XML parsing rules. Polyglot markup does not rely on getting a CDATA instance from the DOM when using MathML or SVG, because the HTML parser does not create a CDATA instance in the DOM.
Many thanks to Daniel Glazman, Richard Ishida, Tony Ross, Sam Ruby, Jonas Sicking, Leif Halvard Silli, Henri Sivonen, Manu Sporny, and Philip Taylor. Special thanks to the W3C TAG and the W3C Internationalization (i18n) Core Working Group.
No informative references.