W3C

Polyglot Markup: HTML-Compatible XHTML Documents

W3C Working Draft 13 January 2011

This version:
http://www.w3.org/TR/2011/WD-html-polyglot-20110113/
Latest published version:
http://www.w3.org/TR/html-polyglot/
Latest editor's draft:
http://dev.w3.org/html5/html-xhtml-author-guide/html-xhtml-authoring-guide.html
Previous version:
http://www.w3.org/TR/2010/WD-html-polyglot-20101019/
http://www.w3.org/TR/2010/WD-html-polyglot-20100624/
Editor:
Eliot Graff, Microsoft Corporation

Abstract

A document that uses polyglot markup is a document that is a stream of bytes that parses into identical document trees (with the exception of the xmlns attribute on the root element) when processed as HTML and when processed as XML. Polyglot markup that meets a well defined set of constraints is interpreted as compatible, regardless of whether they are processed as HTML or as XHTML, per the HTML5 specification. Polyglot markup uses a specific DOCTYPE, namespace declarations, and a specific case—normally lower case but occasionally camel case—for element and attribute names. Polyglot markup uses lower case for certain attribute values. Further constraints include those on empty elements, named entity references, and the use of scripts and style.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document summarizes design guidelines for authors who wish their XHTML or HTML documents to validate on either HTML or XML parsers, assuming the parsers to be HTML5-compliant. This specification is intended to be used by web authors. It is not a specification for user agents and creates no obligations on user agents. Note that this recommendation does not define how HTML5-conforming user agents should process HTML documents. Nor does it define the meaning of the Internet Media Type text/html. For user agent guidance and for these definitions, see [HTML5] and [RFC2854].

This document was published by the HTML working group as a Working Draft. This document is intended to become a W3C Recommendation. Please submit comments regarding this document by using the W3C's public bug database ( http://www.w3.org/Bugs/Public/) with the product set to HTML WG and the component set to HTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff). If you cannot access the bug database, submit comments to public-html@w3.org@w3.org (subscribe, archives) and arrangements will be made to transpose the comments to the bug database. All feedback is welcome.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This is a work in progress! For the latest updates from the HTML WG, possibly including important bug fixes, please look at the editor's draft instead.

Table of Contents

1. Introduction

This section is non-normative.

It is often valuable to be able to serve HTML5 documents that are also well formed XML documents. An author may, for example, use XML tools to generate a document, and they and others may process the document using XML tools. These documents are served as text/html. The language used to create documents that can be parsed by both HTML and XML parsers is called polyglot markup. Polyglot markup is the overlap language of documents which are both HTML5 documents and XML documents.

2. Processing Instructions and the XML Declaration

Processing Instructions and the XML Declaration are both forbidden in polyglot markup.

3. Specifying a Document's Character Encoding

Polyglot markup uses either UTF-8 or UTF-16. UTF-8 is preferred. When polyglot markup uses UTF-8, it does not include a BOM. When polyglot markup uses UTF-16, it includes the BOM indicating little-endian UTF-16 or big-endian UTF-16.

Polyglot markup declares character encoding one of two ways:

Using <meta charset="*"/> has no effect in XML. Therefore, polyglot markup may use <meta charset="*"/> provided the document is encoded as UTF-8 and the value of charset is a case-insensitive match for the string "utf-8". However, because the mime-type is not necessarily text/html, polyglot markup does not use <meta content=”text/html; charset”>.

Note that the W3C Internationalization (i18n) Group recommends to always include a visible encoding declaration in a document, because it helps developers, testers, or translation production managers to check the encoding of a document visually.

4. The DOCTYPE

Polyglot markup uses a document type declaration (DOCTYPE) specified by section 8.1.1 of [HTML5]. In addition, the DOCTYPE conforms to the following rules:

Note that using about:legacy-compat in XML may yield unpredictable parsing results, depending on the XML processing pipeline.

Polyglot markup does not use document type declarations for HTML4, HTML3, or HTML2, regardless of whether they contain a URI or not and regardless of their effect in HTML5 parsers, as these document type declarations are not compatible with XHTML.

5. Namespaces

The following rules apply to namespaces used in polyglot markup.

5.1 Element-Level Namespaces

[HTML5] introduces undeclared (native) default namespaces for the root HTML element <html>, the root SVG element <svg>, and the root MathML element <math>. Polyglot markup declares the following default namespaces, when the markup languages are included in the document, to maintain XML-compatibility [XML10]:

Polyglot markup declares the default namespaces on the root HTML element <html>, the root SVG element <svg>, and the root MathML element <math>, and on any HTML elements used as children of SVG or MathML elements. Polyglot markup does not declare any other default or prefixed element namespace, because [HTML5] does not natively support the declaring of any other default or prefixed element namespace.

5.2 Attribute-Level Namespaces

[HTML5] introduces undeclared (native) support for attributes in the XLink namespace and with the prefix xlink:. Polyglot markup declares the XLink namespace on the HTML root element (<html>) or once on the foreign element where it is used (<svg> or <math>), to maintain XML-compatibility [XML10].

In polyglot markup, the xlink prefix uses the namespace declaration xmlns:xlink="http://www.w3.org/1999/xlink" before using the xlink prefix for the following attributes:

Furthermore, polyglot markup defines the xlink prefix only on foreign elements (any SVG or MathML element) but not the root <html> element or any other HTML element.

Note that there are other prefixed attributes that can be used beyond xlink:href (such as xml:base). Polyglot markup does not declare these prefixes via xmlns. The prefixes are implicitly declared in XML and are automatically applied to the appropriate attributes in HTML.

6. Elements

Polyglot markup conforms to the following rules regarding elements.

6.1 Required Elements

Every polyglot markup document contains an <html>, <head>, <title>, and <body> element. The <html> element is the root element. The <head> and <body> elements are children of the <html> element. The <title> element is a child of the <head> element. Therefore, the following source code would be the most basic polyglot markup document.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
  <head>
    <title></title>
  </head>
  <body>
  </body>
</html>

Polyglot markup explicitly uses a tbody element surrounding groups of tr elements within a table element. HTML parsers insert the tbody element, but XML parsers do not, thus creating different DOMs.

Correct:

<table>
<tbody>
<tr>...
Incorrect:
<table>
<tr>...

Polyglot markup explicitly uses a colgroup element surrounding groups of col elements within a table element. HTML parsers insert the colgroup element, but XML parsers do not, thus creating different DOMs.

Correct:

<table>
<colgroup>
<col>...
Incorrect:
<table>
<col>...

6.2 Elements that Cannot Be Used in Polyglot Markup

Polyglot markup does not use the <noscript> element, because the <noscript> element cannot be used in XML documents. [HTML5]

6.3 Case-Sensitivity

The following guidelines apply to any usage of element names, attribute names, or attribute values in markup, script, or CSS. Polyglot markup uses lower case letters for all ASCII letters. For non-ASCII letters—such as Greek, Cyrillic, or non-ASCII Latin letters—polyglot markup respects case sensitivity as it is called for.

6.3.1 Element Names

Polyglot markup uses the correct case for element names.

  • Polyglot markup uses lowercase letters for all HTML element names.
  • Polyglot markup uses lowercase letters for all MathML element names.
  • Polyglot markup uses lowercase letters for all SVG element names except the following, for which polyglot markup uses mixed case:
    • altGlyph
    • altGlyphDef
    • altGlyphItem
    • animateColor
    • animateMotion
    • animateTransform
    • clipPath
    • feBlend
    • feColorMatrix
    • feComponentTransfer
    • feComposite
    • feConvolveMatrix
    • feDiffuseLighting
    • feDisplacementMap
    • feDistantLight
    • feFlood
    • feFuncA
    • feFuncB
    • feFuncG
    • feFuncR
    • feGaussianBlur
    • feImage
    • feMerge
    • feMergeNode
    • feMorphology
    • feOffset
    • fePointLight
    • feSpecularLighting
    • feSpotLight
    • feTile
    • feTurbulence
    • foreignObject
    • glyphRef
    • linearGradient
    • radialGradient
    • textPath

6.3.2 Attribute Names

Polyglot markup uses the correct case for attribute names.

  • Polyglot markup uses lowercase letters in attribute names for all HTML elements.
  • Polyglot markup uses lowercase letters in attribute names for all MathML elements except the lowercase definitionurl, which polyglot markup changes to the mixed case definitionURL.
  • Polyglot markup uses lowercase letters in attribute names for all SVG elements except the following, for which polyglot markup uses mixed case:
    • attributeName
    • attributeType
    • baseFrequency
    • baseProfile
    • calcMode
    • clipPathUnits
    • contentScriptType
    • contentStyleType
    • diffuseConstant
    • edgeMode
    • externalResourcesRequired
    • filterRes
    • filterUnits
    • glyphRef
    • gradientTransform
    • gradientUnits
    • kernelMatrix
    • kernelUnitLength
    • keyPoints
    • keySplines
    • keyTimes
    • lengthAdjust
    • limitingConeAngle
    • markerHeight
    • markerUnits
    • markerWidth
    • maskContentUnits
    • maskUnits
    • numOctaves
    • pathLength
    • patternContentUnits
    • patternTransform
    • patternUnits
    • pointsAtX
    • pointsAtY
    • pointsAtZ
    • preserveAlpha
    • preserveAspectRatio
    • primitiveUnits
    • refX
    • refY
    • repeatCount
    • repeatDur
    • requiredExtensions
    • requiredFeatures
    • specularConstant
    • specularExponent
    • spreadMethod
    • startOffset
    • stdDeviation
    • stitchTiles
    • surfaceScale
    • systemLanguage
    • tableValues
    • targetX
    • targetY
    • textLength
    • viewBox
    • viewTarget
    • xChannelSelector
    • yChannelSelector
    • zoomAndPan

6.3.3 Attribute Values

Polyglot markup requires the case used for characters in the values of the following attributes to be consistent between markup, DOM APIs, and CSS when these attributes are used on HTML elements. This is because XML is case sensitive, but the values of these attributes are treated as case insensitive in HTML when matched via CSS selectors (See 4.14.1 Case-sensitivity, in the HTML5 specification). [HTML5] In addition, polyglot markup respects the case sensitivity of all other attribute values and for non-ASCII characters in the values of the attributes listed. Note that other specifications, such as RDFa, may place additional restrictions on the allowed values of certain attributes.

  • accept
  • accept-charset
  • align
  • alink
  • axis
  • bgcolor
  • charset
  • checked
  • clear
  • codetype
  • color
  • compact
  • declare
  • defer
  • dir
  • direction
  • disabled
  • enctype
  • face
  • frame
  • hreflang
  • http-equiv
  • lang
  • language
  • link
  • media
  • method
  • multiple
  • nohref
  • noresize
  • noshade
  • nowrap
  • readonly
  • rel
  • rev
  • rules
  • scope
  • scrolling
  • selected
  • shape
  • target
  • text
  • type
  • valign
  • valuetype
  • vlink

6.4 Void Elements

Polyglot markup uses only the elements in the following list as void elements.

Polyglot markup uses the minimized tag syntax for void elements, e.g. <br/>, rather than the alternative syntax <br></br>.

Given an empty instance of an element whose content model is not EMPTY (for example, an empty title or paragraph) polyglot markup does not use the minimized form (e.g. the document uses <p></p> and not <p />).

Note that MathML and SVG elements may be either self-closing or contain content.

6.5 Elements with Special Considerations

The following elements or their considerations require exceptions to the general rules for polyglot markup.

6.5.1 Newlines in <textarea> and <pre> Elements

When polyglot markup uses either a <textarea> or <pre> element, the text within the element does not begin with a newline.

6.5.2 Elements that Cannot Contain Special Characters

Due to the conflict between parsing rules between HTML and XML, polyglot markup uses the following elements only if they do not contain angled brackets ("<" or ">") or ampersands ("&").

  • plaintext
  • xmp

7. Attributes

Within an attribute's value, polyglot markup represents tabs, line feeds, and carriage returns as numeric character references rather than by using literal characters. For example, within an attribute's value, polyglot markup uses &#x9; for a tab rather than the literal character '\t'. This is because of attribute-value normalization in XML [XML10].

Polyglot markup surrounds all attribute values with quotation marks. Polyglot markup surrounds attribute values by either single quotation marks or by double quotation marks.

See also Attribute Values.

7.1 Disallowed Attributes

The following attributes are not allowed in polyglot markup. These attributes have effects in documents parsed as XML but do not have effects in documents parsed as text/html. The HTML5 spec therefore defines them as invalid in text/html documents. [HTML5]

Note that the xml:space and xml:base attributes are allowed on SVG and MathML elements.

7.2 Language Attributes

When using language attributes, polyglot markup uses both the lang and xml:lang attributes. Neither attribute is to be used without the other, and polyglot markup maintains identical values for both lang and xml:lang.

Polyglot markup uses the language attributes in the html element to set the default language for the document.

8. Named Entity References

Polyglot markup uses only the following named entity references:

For entities beyond the previous list, polyglot markup uses character references. For example, polyglot markup uses &#xA0; instead of &nbsp;. Note that polyglot markup may use decimal values for escape characters (such as &#160; in the previous example); however, the Character Model for the World Wide Web recommends that content should use the hexadecimal form of character escapes rather than the decimal form when both are available. [CHARMOD]

9. Script and Style

Polyglot markup includes script and style commands by linking to external files rather than including them in-line. Polyglot markup does not link to an external stylesheet by using the xml-stylesheet processing instruction. See also Processing Instructions and the XML Declaration.

The following examples show how polyglot markup includes external script and style, respectively:

<script src="external.js"></script>
<link rel="stylesheet" href="external.css"/>

Although document.write() and document.writeln() are valid in an HTML document, neither function may be used in XHTML. Therefore, neither is used in polyglot markup. Instead, use the innerHTML property for both HTML and XHTML. Note that the innerHTML property takes a string. XML parsers parse the string as XML in XHTML. HTML parsers parse the string as HTML in HTML. Because of the difference in parsing, if you send the parser content that does not follow the rules for polyglot markup the results will differ for a DOM create with an XML parser and one created with an HTML parser.

9.1 External Script and Style

Polyglot markup uses external scripts if that document's script or style sheet uses < or & or ]]> or --. Note that XML parsers are permitted to silently remove the contents of comments; therefore, the historical practice of hiding scripts and style sheets within comments to make the documents backward compatible is likely to not work as expected in XML-based user agents.

9.2 In-line Script and Style

If polyglot markup must use script or style commands within its source code, either use safe content or wrap the command in a CDATA section. However, polyglot markup does not use a CDATA section unless it is being used within foreign content.

9.2.1 Safe Content

Safe content is content that does not contain a < or & character. The following example is safe because it does not contain problematic characters within the <script> tag.

<script>document.body.appendChild(document.createElement("div"));</script>

9.2.2 Wrapping a Command in a CDATA Section

Note that you cannot achieve the same DOM in both XHTML and HTML by using in-line commands in a CDATA section. However, this is not usally a problem unless the code has a dependency on the exact number of text nodes under a <script> or <style> element. The following examples show in-line script and style commands wrapped in a CDATA section.

<script>
	//<![CDATA[
		(script goes here)
	//]]>
	</script>
<style>
	/*<![CDATA[*/
		(styles go here)
	/*]]>*/
	</style>

When using MathML or SVG, the parser follows the XML parsing rules. Polyglot markup does not rely on getting a CDATA instance from the DOM when using MathML or SVG, because the HTML parser does not create a CDATA instance in the DOM.

10. Comments in Polyglot Markup

Polyglot markup does not begin a comment with either ">" or "->".

11. Exceptions from the Foreign Content Parsing Rules

12. Example Document

The following example code acts as polyglot markup and validates as either XHTML or as HTML. You can view the page live at http://dev.w3.org/html5/html-xhtml-author-guide/SamplePage.html.

<!DOCTYPE html>


<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">

  <head>
    <title>A Sample Page Using Polglot Markup</title>
	<!-- The link element is self-closing as described in Section 6.4 Void Elements -->
	<!-- Style commands are included by linking to an external file rather than including them in-line, 
	  as described in Section 9. Script and Style -->
	<link type="text/css" rel="stylesheet" href="Sample.css"/>
  </head>

  <body>
    <h1>Sample Page Using Polyglot Markup</h1>
    <p>
      The source code for this document uses polyglot markup, 
      a document that is a stream of bytes that parses into identical document trees 
      (with the exception of the xmlns attribute on the root element) when processed as HTML and when processed as XML.
      The source code for this document also contains additional comments about the use of 
      polyglot markup.
    </p>
		
    <h2>Foreign Elements</h2>
    <p>
      The following shapes use SVG elements.
      Polyglot markup introduces undeclared (native) default namespaces 
      for the the root SVG element (<svg>) and respects the mixed-case element names and values 
      when appropriate, as described in sections 5.1 Element-Level Namespaces, 
      6.3.1 Element Names, and 6.3.3 Attribute Values.
    </p>

    <!-- Polyglot markup declares the xlink: namespace on the <svg> element to maintain XML-compatibility  -->
    <svg width="350" height="250" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
      <g>
        <title>Three SVG shapes</title>
        <desc>
          This SVG image contains an ellipse filled with a gradient that goes from white to blue as it moves outward from the center. 
          A yellow rectangle with a black border overlaps the ellipse in the upper-left quadrant, 
          and a red spiral on a white background overlaps the ellipse in the bottom-right quadrant. 
          The red spiral is also a link to the example code for that SVG shape.
        </desc>
        <defs>
          <!-- Note that "radialGradient" and "myGradient" respect mixed-case values. -->
          <radialGradient id="myGradient" cx="50%" cy="50%" r="50%" fx="50%" fy="50%">
            <stop offset="0%" style="stop-color:rgb(200,200,200); stop-opacity:0"/>
            <stop offset="100%" style="stop-color:rgb(0,0,255); stop-opacity:1"/>
          </radialGradient>
        </defs>
      <ellipse cx="50%" cy="50%" rx="50%" ry="42%" style="fill:url(#myGradient)"/>
      <rect x="0" y="0" width="100" height="100" style="fill: yellow; stroke: black;"/> 
      <a xlink:href="http://www.w3schools.com/svg/tryit.asp?filename=path2&type=svg">
        <path  transform="translate(60, -175)" d="M153 334
          C153 334 151 334 151 334
          C151 339 153 344 156 344
          C164 344 171 339 171 334
          C171 322 164 314 156 314
          C142 314 131 322 131 334
          C131 350 142 364 156 364
          C175 364 191 350 191 334
          C191 311 175 294 156 294
          C131 294 111 311 111 334
          C111 361 131 384 156 384
          C186 384 211 361 211 334
          C211 300 186 274 156 274"
          style="fill:white;stroke:red;stroke-width:2"/>
        </a>
      </g>
    </svg> 		
    <h2>Void Elements</h2>
    <!-- Given an empty instance of an element whose content model is not EMPTY (in this case, an empty paragraph) 
    polyglot markup does not use the minimized form, as described in Section 6.4 Void Elements -->
    <p></p>
    <p>
      There is an empty <p> element before this paragraph. 
      Polyglot markup uses <p></p> and not <p />.  
    </p>
    <p>
      Polyglot markup treats certain elements as self-closing, 
      void elements, such as the following <img> element.
    </p>
    <img height="48" width="72" alt="W3C" src="http://www.w3.org/Icons/w3c_home"/>
    <p>
      For more information, see Section 6.4 Void Elements.
    </p>


    <h2>Required Elements</h2>
    <p>
      The following table uses the required <tbody> element, as described in 
      Section 6.1 Required Elements.
    </p>
    <table>
      <tbody>
        <tr>
          <th>Column One</th>
          <th>Column Two</th>
        </tr>
        <tr>
          <td>Row 1, Column 1</td>
          <td>Row 1, Column 2</td>
        </tr>
        <tr>
          <td>Row 2, Column 1</td>
          <td>Row 2, Column 2</td>
        </tr>
        <tr>
          <td>Row 3, Column 1</td>
          <td>Row 3, Column 2</td>
        </tr>
      </tbody>
    </table>

    <p>
      The following table uses the required <colgroup> element, as described in 
      Section 6.1 Required Elements.  
    </p>
    <table>
      <colgroup>
        <col style="background-color:silver"/>
        <col style="background-color:gray"/>
        <col style="background-color:yellow"/>
      </colgroup>
      <tbody>
        <tr>
          <th>ISBN</th>
          <th>Title</th>
          <th>Price</th>
        </tr>
        <tr>
          <td>3476896</td>
          <td>My first HTML</td>
          <td>$53</td>
        </tr>
        <tr>
          <td>1234567</td>
          <td>Intermediate Polyglot</td>
          <td>$49</td>
        </tr>
      </tbody>
    </table>

    <h2>Named Entity References</h2>
    <p>
      This paragraph uses the string "&amp;" for ampersands ("&") and uses the string "&#xA0;" 
      for a nonbreaking space between the words "polyglot markup," as described in 
      Section 8. Named Entity References.
    </p>
  </body>
</html> 

A. Acknowledgements

Many thanks to Daniel Glazman, Richard Ishida, Tony Ross, Sam Ruby, Jonas Sicking, Leif Halvard Silli, Henri Sivonen, Manu Sporny, and Philip Taylor. Special thanks to the W3C TAG and the W3C Internationalization (i18n) Core Working Group.

B. References

B.1 Normative references

[CHARMOD]
Martin J. Dürst; et al. Character Model for the World Wide Web 1.0: Fundamentals. 15 February 2005. W3C Recommendation. URL: http://www.w3.org/TR/2005/REC-charmod-20050215
[HTML5]
Ian Hickson; David Hyatt. HTML 5. 4 March 2010. W3C Working Draft. (Work in progress.) URL: http://www.w3.org/TR/2010/WD-html5-20100304/
[HTTP11]
R. Fielding; et al. Hypertext Transfer Protocol - HTTP/1.1. June 1999. Internet RFC 2616. URL: http://www.ietf.org/rfc/rfc2616.txt
[RFC2854]
D. Connolly; L. Masinter. The 'text/html' Media Type. June 2000. Internet RFC 2854. URL: http://www.rfc-editor.org/rfc/rfc2854.txt
[XML10]
C. M. Sperberg-McQueen; et al. Extensible Markup Language (XML) 1.0 (Fifth Edition). 26 November 2008. W3C Recommendation. URL: http://www.w3.org/TR/2008/REC-xml-20081126/

B.2 Informative references

No informative references.