Copyright © 2010 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
A document that uses polyglot markup is an HTML5 document which is at the same time an XML document and an HTML document, and which meets a well defined set of constraints. Polyglot markup that meets these constraints as interpreted as compatible, regardless of whether they are processed as HTML or as XHTML, per the HTML5 specification. Polyglot markup uses a specific doctype, namespace declarations, and a specific case—normally lower case but occasionally camel case—for element and attribute names. Polyglot markup uses lower case for certain attribute values. Further constraints include those on empty elements, named entity references, and the use of scripts and style.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document summarizes design guidelines for authors who wish their XHTML or HTML documents to validate on either HTML or XML parsers, assuming the parsers to be HTML5-compliant. This specification is intended to be used by web authors. It is not a specification for user agents and creates no obligations on user agents. Note that this recommendation does not define how HTML5-conforming user agents should process HTML documents. Nor does it define the meaning of the Internet Media Type text/html. For user agent guidance and for these definitions, see [HTML5] and [RFC2854].
This document was published by the W3C HTML as a First Public Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-html@w3.org (subscribe, archives). All feedback is welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. This document is informative only. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This section is non-normative.
It is often valuable to be able to serve HTML5 documents that are also valid XML documents. An author may, for example, use XML tools to generate a document, and they and others may process the document using XML tools. These documents are served as text/html. The language used to create documents that can be parsed by both HTML and XML parsers is called polyglot markup. Polyglot markup is the overlap language of documents which are both HTML5 documents and XML documents.
Polyglot markup does not use processing instructions. Note that the parsing rules for the XML declaration are not processing instructions and are defined separately in Prolog and Document Type Declaration.
Polyglot markup uses either UTF-8 or UTF-16, although generally UTF-8 is preferred. When polyglot markup uses UTF-16, it should include the BOM indicating UTF-16LE or UTF-16BE. In addition, polyglot markup need not include the meta charset declaration, because the parser would have to read UTF-16 in order to parse it by definition.
In short, for correct character encoding, polyglot markup must either:
meta
tag to
specify the appropriate character encoding.
If polyglot markup uses an encoding other than UTF-8 or UTF-16, it must include the XML declaration;
however, in this case the document must
also include the HTML meta
tag specifying the character
set.
When polyglot markup uses both the XML declaration and the HTML meta
tag, these must specify the same
character and coding.
Polyglot markup uses the <!DOCTYPE html>
doctype.
Note that for polyglot markup the string, html
, must be lower case.
For a pure HTML document, the string is defined as case-insensitive. [HTML5]
The following rules apply to namespaces used in polyglot markup.
<html>
element uses the namespace
declaration xmlns="http://www.w3.org/1999/xhtml"
.
<math>
elements uses the namespace
declaration xmlns="http://www.w3.org/1998/Math/MathML"
.
<svg>
elements uses the namespace
declaration xmlns="http://www.w3.org/2000/svg"
.
xmlns:xlink="http://www.w3.org/1999/xlink"
before using xlink:href. The prefix can be defined either:
<html>
element.
<svg>
element that contains
one or more elements with xlink:href attributes.
Each document using polyglot markup must have a root html
element.
The root html
element must contain both a head
and a body
element.
The head
element must
contain a title
element.
Polyglot markup must
explicitly have a tbody
element surrounding groups of tr
elements within a table
element.
HTML parsers insert the tbody
element, but XML
parsers do not, thus creating different DOMs.
Correct:
<table> <tbody> <tr>...Incorrect:
<table> <tr>...
The following guidelines apply to any usage of element names, attribute names, or attribute values in markup, script, or CSS. When required, polyglot markup uses lower case letters for all ASCII letters; however, case requirements do not apply to non-ASCII letters such as Greek, Cyrillic, or non-ASCII Latin letters.
Polyglot markup uses the correct case for element names.
altGlyph
altGlyphDef
altGlyphItem
animateColor
animateMotion
animateTransform
clipPath
feBlend
feColorMatrix
feComponentTransfer
feComposite
feConvolveMatrix
feDiffuseLighting
feDisplacementMap
feDistantLight
feFlood
feFuncA
feFuncB
feFuncG
feFuncR
feGaussianBlur
feImage
feMerge
feMergeNode
feMorphology
feOffset
fePointLight
feSpecularLighting
feSpotLight
feTile
feTurbulence
foreignObject
glyphRef
linearGradient
radialGradient
textPath
Polyglot markup uses the correct case for attribute names.
The lowercase definitionurl
must be changed to the mixed case definitionURL
.
attributeName
attributeType
baseFrequency
baseProfile
calcMode
clipPathUnits
contentScriptType
contentStyleType
diffuseConstant
edgeMode
externalResourcesRequired
filterRes
filterUnits
glyphRef
gradientTransform
gradientUnits
kernelMatrix
kernelUnitLength
keyPoints
keySplines
keyTimes
lengthAdjust
limitingConeAngle
markerHeight
markerUnits
markerWidth
maskContentUnits
maskUnits
numOctaves
pathLength
patternContentUnits
patternTransform
patternUnits
pointsAtX
pointsAtY
pointsAtZ
preserveAlpha
preserveAspectRatio
primitiveUnits
refX
refY
repeatCount
repeatDur
requiredExtensions
requiredFeatures
specularConstant
specularExponent
spreadMethod
startOffset
stdDeviation
stitchTiles
surfaceScale
systemLanguage
tableValues
targetX
targetY
textLength
viewBox
viewTarget
xChannelSelector
yChannelSelector
zoomAndPan
Polyglot markup uses lowercase letters for the values of the attributes in the following list when they exist on HTML elements. More specifically, where required, polyglot markup must use lower case letters for all ASCII letters in these attribute values; however, case requirements do not apply to non-ASCII letters such as Greek, Cyrillic, or non-ASCII Latin letters. Attributes for HTML elements other than those in the following list may have values made of mixed case letters. All attributes on non-HTML elements may have values made of mixed case letters.
accept
accept-charset
align
alink
axis
bgcolor
charset
checked
clear
codetype
color
compact
declare
defer
dir
direction
disabled
enctype
face
frame
hreflang
http-equiv
lang
language
link
media
method
multiple
nohref
noresize
noshade
nowrap
readonly
rel
rev
rules
scope
scrolling
selected
shape
target
text
type
valign
valuetype
vlink
Polyglot markup uses only the elements in the following list as empty elements.
area
base
br
col
command
embed
hr
img
input
keygen
link
meta
param
source
Polyglot markup uses the minimized tag syntax for empty
elements, e.g. <br/>
.
The alternative syntax <br></br>
allowed by XML gives uncertain results in many existing user agents.
Given an empty instance of an element whose content model is not
EMPTY (for example, an empty title or paragraph) polyglot markup does
not use the minimized form (e.g. the document uses <p></p>
and not <p />
).
Note that MathML and SVG elements may be either self-closing or contain content.
Polyglot markup does not contain line breaks and multiple white space characters within attribute values. These are handled inconsistently by user agents.
Polyglot markup surrounds all attribute values with quotation marks. Attribute values may be surrounded either by single quotation marks or by double quotation marks.
See also Attribute Values.
Polyglot markup uses only the following named entity references:
amp
lt
gt
apos
quot
For entities beyond the previous list, a ployglot document uses
character references. For example, polyglot markup uses  
instead of
.
Script and style commands should be included by linking to external files rather than including them in-line. However, polyglot markup must not link to an external stylesheet by using the xml-stylesheet processing instruction. See also Processing Instructions and the XML Declaration.
The following examples show the proper way to include external script and style, respectively:
<script src="external.js"></script>
<link rel="stylesheet" href="external.css"/>
Although document.write()
and document.writeln()
are valid in an HTML document, neither function may be used in XHTML.
Therefore, neither is used in polyglot markup.
Instead, use the innerHTML
property for both HTML
and XHTML.
Note that the innerHTML
property takes a string.
XML parsers parse the string as XML in XHTML.
HTML parsers parse the string as HTML in HTML.
Because of the difference in parsing, if you send the parser
content that does not follow the rules for polyglot markup the results
will differ for a DOM create with an XML parser and one created with an
HTML parser.
Polyglot markup uses external scripts if that document's script
or style sheet uses <
or &
or ]]>
or --
.
Note that XML parsers are permitted to silently remove the
contents of comments; therefore, the historical practice of hiding
scripts and style sheets within comments to make the documents backward
compatible is likely to not work as expected in XML-based user agents.
If polyglot markup must use script or style commands within its
source code, either use safe content or wrap the command in a CDATA
section.
However, polyglot markup does not use a CDATA
section unless it is being used within foreign content.
Safe content is content that does not contain a <
or &
character.
The following example is safe because it does not contain
problematic characters within the <script>
tag.
<script>document.body.appendChild(document.createElement("div"));</script>
Note that you cannot achieve same DOM in both XHTML and HTML by
using in-line commands in a CDATA section.
However, this is not usally a problem unless the code has a
dependency on the exact number of text nodes under a <script>
or <style>
element.
The following examples show in-line script and style commands
wrapped in a CDATA
section.
<script> //<![CDATA[ (script goes here) //]]> </script>
<style> /*<![CDATA[*/ (styles go here) /*]]>*/ </style>
When using MathML or SVG, the parser follows the XML parsing rules. Polyglot markup does not rely on getting a CDATA instance from the DOM when using MathML or SVG, because the HTML parser does not create a CDATA instance in the DOM.
Many thanks to Daniel Glazman, Tony Ross, Sam Ruby, Jonas Sicking, Henri Sivonen, and Philip Taylor. Special thanks to the W3C TAG.
No informative references.