Polyglot Markup: A robust profile of the HTML5 vocabulary

Abstract

A document that uses polyglot markup is a document that is a stream of bytes that parses into identical document trees (with some exceptions, as noted in the Introduction) when processed either as HTML or when processed as XML. Polyglot markup that meets a well-defined set of constraints is interpreted as compatible, regardless of whether it is processed as HTML or as XHTML, per the HTML5 specification. Polyglot markup uses a specific DOCTYPE, namespace declarations, and a specific case—normally lower case but occasionally camel case—for element and attribute names. Polyglot markup uses lower case for certain attribute values. Further constraints include those on void elements, named entity references, and the use of scripts and style.

4. Writing HTML documents

4.1 Processing instructions and the XML declaration

Processing instructions and the XML declaration are both forbidden in polyglot markup.

4.2 Specifying a document’s character encoding

Polyglot markup uses the UTF-8 character encoding, the only character encoding for which both HTML and XML require support. HTML requires UTF-8 to be explicitly declared to avoid fallback to a legacy encoding. [HTML5]

For XML, UTF-8 is an encoding default. As such, character encoding MAY be left undeclared in XML with the result that UTF8 is still supported [XML10].

Polyglot markup declares the UTF-8 character encoding in the following ways, which may be used separately or in combination (but note that there can only be a single HTML encoding declaration):

Within the document
- By using the Byte Order Mark (BOM) character
- By using the HTML encoding declaration
  - either in its charset attribute form: <meta charset="UTF-8"/>
  - or in its alternative form: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
Outside the document
- By adding "charset=utf-8" to the MIME/HTTP Content-Type header [HTTP11], as the following examples show in HTML and XML, respectively:
Example 1
```
Content-type: text/html; charset=utf-8
```
Example 2
```
Content-type: application/xhtml+xml; charset=utf-8
```
Note that, when serving polyglot documents as XML, charset=UTF-8 can safely be omitted, due to the UTF-8 encoding default of XML:
Example 3
```
Content-type: application/xhtml+xml
```

Note

Both XML and HTML parsers are required to support the byte order mark. The HTML encoding declaration has no effect in XML. When the HTML encoding declaration is the only encoding declaration, the encoding default from XML makes XML parsers treat content as UTF-8.

The W3C Internationalization (i18n) Group recommends to always include a visible encoding declaration in a document, because it helps developers, testers, or translation production managers to check the encoding of a document visually.

4.3 The DOCTYPE

Polyglot markup uses a document type declaration (DOCTYPE) specified by section 8.1.1 of [HTML5]. In addition, the DOCTYPE conforms to the following rules:

The string DOCTYPE is in uppercase letters.
The string SYSTEM, if present, is in uppercase letters.
The string PUBLIC, if present, is in uppercase letters.
A Formal Public Identifier (FPI), if present, is a case-sensitive match of the registered FPI to which it points.
A URI, if present in the document type declaration, is a case-sensitive match of the URI to which it points.
- If the URI is the string about:legacy-compat, polyglot markup includes the string in lowercase letters, as required by HTML5.
- If the URI is an http URL, the URI points to the correct resource, using case-sensitive letters.

Note

The string html SHOULD be in lowercase letters, in order to be both well-formed and valid XML; however, the string MAY be in mixed case or uppercase letters and still be well-formed XML.

Note that using about:legacy-compat in XML may yield unpredictable parsing results, depending on the XML processing pipeline.

Polyglot markup does not use document type declarations for HTML4, HTML3, or HTML2, regardless of whether they contain a URI or not and regardless of their effect in HTML5 parsers, as these document type declarations are not compatible with XHTML.

4.4 Namespaces

The following rules apply to namespaces used in polyglot markup.

4.4.1 Element-level namespaces

[HTML5] introduces undeclared (native) default namespaces for the root HTML element, html, the root SVG element, svg, and the root MathML element, math. Polyglot markup declares the following default namespaces, when the markup languages are included in the document, to maintain XML compatibility [XML10]:

<html xmlns="http://www.w3.org/1999/xhtml">
<math xmlns="http://www.w3.org/1998/Math/MathML">
<svg xmlns="http://www.w3.org/2000/svg">

Polyglot markup declares the default namespaces on the root HTML element, html, the root SVG element, svg, and the root MathML element math, and on any HTML elements used as children of SVG or MathML elements. Polyglot markup does not declare any other default or prefixed element namespace, because [HTML5] does not natively support the declaring of any other default or prefixed element namespace.

4.4.2 Attribute-level namespaces

[HTML5] introduces undeclared (native) support for attributes in the XLink namespace and with the prefix xlink:. To maintain XML-compatibility, polyglot markup explicitly declares the XLink namespace: xmlns:xlink="http://www.w3.org/1999/xlink"). [XML10]

For conformance with the HTML specification’s conformance rules, the declaration has to take place in each foreign content section where it is used, typically on a such section’s root element (e.g. on the svg start tag for an SVG section and on the math start tag for a MathML section) since the declaration must occur before using any of the xlink: prefixed attributes,

xlink:actuate
xlink:arcrole
xlink:href
xlink:role
xlink:show
xlink:title
xlink:type

Note that there are other prefixed attributes that can be used beyond xlink:href (such as xml:base). Polyglot markup does not declare these prefixes via xmlns. The prefixes are implicitly declared in XML and are automatically applied to the appropriate attributes in HTML.

The namespaced attributes, such as xml:lang="" and xmlns="", are "namespaced" within XHTML, SVG and MathML. Thus, the rules for how they can be used as CSS selectors is governed by CSS namespaces. [CSS3NAMESPACE] For more about the issues related to attribute selectors and namespaces, with and without prefixes, see the section on Scripting and styling polyglot markup.

4.5 Element syntax

Polyglot markup conforms to the following rules regarding elements.

4.5.1 Required elements and tags

Polyglot markup does not employ optional tags. HTML5’s concept of optional tags – missing start tags and/or end tags – covers elements that the HTML parser itself automatically adds to the DOM if the code doesn’t contain the tags for them. Because XML does not have such a feature that adds missing start and/or end tags to the DOM, omitting a tag in polyglot markup is equivalent to producing a document that is not well-formed or, if both tags are omitted, equivalent to not adding the element at all.

That polyglot markup doesn’t operate with optional tags, may create surprises for an author not used to adding the tbody tags in their code, for example, or to someone accustomed to omitting the end tag of the p element. However, the requirement to be well-formed with regard to tags is a key feature of polyglot markup that makes the code robust against subpar parsers and authoring surprises.

4.5.1.1 A minimal HTML document

Every polyglot markup document therefore contains an html, head, title, and body element, represented in the code with their tags. The html element is the root element. The head and body elements are children of the html element. The title element is a child of the head element. Therefore, the following source code would be the most basic polyglot markup document.

Example 4

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
  <head>
    <title></title>
  </head>
  <body>
  </body>
</html>

4.5.1.2 Required tags examples

Whenever it uses a tr element, polyglot markup always wraps the tr element inside a tbody, thead, or tfoot element. In HTML, if a group of one or more adjacent tr elements are not explictly wrapped inside a tbody, thead, or tfoot element, the HTML parser creates and wraps a new tbody element around the tr elements. XML parsers do not create the tbody element, thus offering the potential for creating different DOMs.

Correct:

Example 5

<table>
<tbody>
<tr>...

Incorrect:

Example 6

<table>
<tr>...

Whenever it uses col elements within a table element, polyglot markup explicitly uses a colgroup element surrounding groups of the col elements. In HTML, if a group of one or more adjacent col elements are not explicitly wrapped inside a colgroup element, the HTML parser creates and wraps a new colgroup element around the col elements. XML parsers do not create the colgroup element, thus offering the potential for creating different DOMs.

Correct:

Example 7

<table>
<colgroup>
<col>...

Incorrect:

Example 8

<table>
<col>...

4.5.2 Excluded elements and tags

The noscript element is non-conforming in XHTML, and therefore also in polyglot markup, due to the fact that XML has no mechanism by which to produce the effect it has in HTML.[HTML5]

Note

Elements with features designed for HTML alone, are non-polyglot from the outset. Currently, all such elements are legacy elements, and all but noscript, which HTML5 forbids in XHTML alone, are also obsoleted by the HTML specification for both HTML and XHTML.

4.5.3 Case-sensitivity

The following apply to any usage of element names, attribute names, or attribute values in markup, script, or CSS. Polyglot markup uses lower case letters for all ASCII letters. For non-ASCII letters—such as Greek, Cyrillic, or non-ASCII Latin letters—polyglot markup respects case sensitivity as it is called for.

4.5.3.1 Element names

Polyglot markup uses the correct case for element names.

Polyglot markup uses lowercase letters for all HTML element names.
Polyglot markup uses lowercase letters for all MathML element names.
Polyglot markup uses lowercase letters for all SVG element names except the following, for which polyglot markup uses mixed case:
- altGlyph
- altGlyphDef
- altGlyphItem
- animateColor
- animateMotion
- animateTransform
- clipPath
- feBlend
- feColorMatrix
- feComponentTransfer
- feComposite
- feConvolveMatrix
- feDiffuseLighting
- feDisplacementMap
- feDistantLight
- feFlood
- feFuncA
- feFuncB
- feFuncG
- feFuncR
- feGaussianBlur
- feImage
- feMerge
- feMergeNode
- feMorphology
- feOffset
- fePointLight
- feSpecularLighting
- feSpotLight
- feTile
- feTurbulence
- foreignObject
- glyphRef
- linearGradient
- radialGradient
- textPath

4.5.3.2 Attribute names

Polyglot markup uses the correct case for attribute names.

Polyglot markup uses lowercase letters in attribute names for all HTML elements.
Polyglot markup uses lowercase letters in attribute names for all MathML elements except the lowercase definitionurl, which polyglot markup changes to the mixed case definitionURL.
Polyglot markup uses lowercase letters in attribute names for all SVG elements except the following, for which polyglot markup uses mixed case:
- attributeName
- attributeType
- baseFrequency
- baseProfile
- calcMode
- clipPathUnits
- contentScriptType
- contentStyleType
- diffuseConstant
- edgeMode
- externalResourcesRequired
- filterRes
- filterUnits
- glyphRef
- gradientTransform
- gradientUnits
- kernelMatrix
- kernelUnitLength
- keyPoints
- keySplines
- keyTimes
- lengthAdjust
- limitingConeAngle
- markerHeight
- markerUnits
- markerWidth
- maskContentUnits
- maskUnits
- numOctaves
- pathLength
- patternContentUnits
- patternTransform
- patternUnits
- pointsAtX
- pointsAtY
- pointsAtZ
- preserveAlpha
- preserveAspectRatio
- primitiveUnits
- refX
- refY
- repeatCount
- repeatDur
- requiredExtensions
- requiredFeatures
- specularConstant
- specularExponent
- spreadMethod
- startOffset
- stdDeviation
- stitchTiles
- surfaceScale
- systemLanguage
- tableValues
- targetX
- targetY
- textLength
- viewBox
- viewTarget
- xChannelSelector
- yChannelSelector
- zoomAndPan

4.5.3.3 Attribute values

For characters in attribute values, polyglot markup maintains case consistency between markup, DOM APIs, and CSS when these attributes are used on HTML elements.

Polyglot markup maintains case consistency for values on the following attributes, which occur on MIME types, language tags, charsets, booleans, media queries, and keywords. Though not required, an easy way to maintain case-consistency is to use only lower case values for these attributes. Polyglot markup maintains case consistency for these values because, for the purpose of selector matching, attribute values in XML are all treated case sensitively; however, HTML treats the values of these attributes as case insensitive (See 4.14.1 Case-sensitivity, in the HTML5 specification). [HTML5]

accept
accept-charset
charset
checked
defer
dir
direction
disabled
enctype
hreflang
http-equiv
media
method
multiple
readonly
rel (for values that do not contain a colon)
scope
selected
shape
target (keywords only; browsing context names are case-sensitive)
type (on a, link, object, script, or style elements)
type (on input)

Note that other specifications, such as RDFa, may place additional restrictions on the allowed values of certain attributes.

Also note that because XML processors don't recognize lang as containing language information, polyglot markup uses both the lang and the xml:lang attributes (see Language attributes); however, the CSS3 Selectors specification stipulates that language attributes, including xml:lang, are matched in a case insensitive way. [SELECT]

4.6 Element contents

For the different kinds of elements that HTML documents contain, polyglot markup conforms to the following contents rules.

4.6.1 Void elements

In the HTML syntax, void elements are elements that always are empty and never have an end tag. All elements listed as void in the HTML specification or in an extension spec, MUST in polyglot markup have the syntactic form of an XML empty-element tag (<foo/>). Other elements MUST NOT use the XML empty-element tag syntax.

Fig. 1 The void elements of the HTML specification at the time of writing.

area, base, br, col, embed, hr, img, input, keygen, link, meta, param, source, track, wbr

Example: Polyglot markup uses the minimized tag syntax for void elements, e.g. <br/>, and does not use <br></br>.

Example: Given an empty instance of an element whose content model is not EMPTY (for example, an empty title or paragraph) polyglot markup does not use the minimized form. E.g. the document uses <p></p> and not <p/>.

Note

Elements in foreign content, such as MathML and SVG elements, may be either self-closing or contain content.

4.6.2 Raw text elements (`script` and `style`)

In polyglot markup, the contents of all elements listed as raw text elements in the HTML specification or in an extension spec, MUST conform to the extra requirements defined in this section.

Fig. 2 HTML5's list of raw text elements

script, style

In HTML syntax, the content of raw text elements is raw text. In other words, the HTML parser does not treat contained code that looks like tags (element tags and comment tags, character references, CDATA, etc.) as tags, character references, CDATA, etc., but as raw text. (See HTML5 for the exact rules.) In the XHTML syntax, however, the same constructs will be treated as tags, character references, CDATA etc.

As result, it is simpler for authors to comply with the requirement of the default MIME types of the raw text elements in HTML than it is in XHTML. On the other hand, with CDATA, the raw text contents parsed as XHTML can be made even less semantic than the raw text data of HTML, leading to potential harms if the document is parsed as HTML.

Fig. 3 Overview over the differences in how HTML and XML parse raw text elements

Ambiguous string	Info	HTML interpretation	XML interpretation
Ambiguous string	Info	HTML interpretation	if inside `<![CDATA[`section`]]>`	if outside `<![CDATA[`section`]]>`
`<`	LESS-THAN SIGN	uninterpreted (but see the `</script` and `</style` rows)	uninterpreted	interpreted (commences tags, comments, CDATA)
`&`	AMPERSAND	uninterpreted	uninterpreted	interpreted commences character reference or entity
`<!--`	start of comment	partly unintepreted	uninterpreted	interpreted
`-->`	end of comment	partly unintepreted	uninterpreted	interpreted
`<![CDATA[`	start of CDATA declaration	uninterpreted	uninterpreted	interpreted (begins CDATA block)
`]]>`	end of CDATA declaration	uninterpreted	uninterpreted	interpreted (ends CDATA block)
`cdata content`	the content of CDATA sections		uninterpreted	—
`</script`	if occuring inside `script` element and followed by one of "tab" (U+0009), "LF" (U+000A), "FF" (U+000C), "CR" (U+000D), U+0020 SPACE, ">" (U+003E), or "/" (U+002F)	terminates parent	uninterpreted	interpreted
`</style`	if occuring inside `style` element and followed by one of "tab" (U+0009), "LF" (U+000A), "FF" (U+000C), "CR" (U+000D), U+0020 SPACE, ">" (U+003E), or "/" (U+002F)	terminates parent	uninterpreted	interpreted
`<foo></bar>`	all other tags, well-formed or not	uninterpreted	uninterpreted	interpreted subject to normal parsing rules
`&#foo;`	character references	uninterpreted	uninterpreted	interpreted subject to normal parsing rules
`none of the above strings`	Any other string	uninterpreted	uninterpreted	uninterpreted

Syntactically, the polyglot subset is found by

either limiting the content to safe text content, that is, text that gets interpreted the same way in HTML and in XML.
or trying to even out the constraints differences by wrapping the contents in a CDATA section. The CDATA code is then seen as text by the HTML parser (and can thus interfere with the scripting or styling language!), while the XML parser sees the content as text without markup semantics.

Limiting the contents to safe text content requires more planning and control over the code, but can be said to be more robust than the CDATA option as it requires no extra, potentially breakable code to make the scripting or styling language work. The CDATA option on the other hand, gives more freedom and robustness against various errors that can happen because the author isn’t aware of the safe text content limitations or because the code is inserted by a tool that is unable to guarantee that the content is safe.

4.6.2.1 Options for delivering safe text content

Polyglot markup can deliver safe text content both externally and internally.

External safe text content. Polyglot markup can include scripts or stylesheets by linking to external files rather than including the code in-line. External files are parsed as the respective script or stylesheet and are thus not limited by the same restrictions as safe text content.
Fig. 4 Examples of linking to external scripts or stylesheets
Example 9
```

<script    src="external.js" ></script>
<link     href="external.css" rel="stylesheet"/>
<style>@import "external.css";</style>
```
Inline safe text content. Polyglot markup does not use characters or constructs that are interpreted differently in HTML and XML. This means not using the characters < and & as well as the CDATA end mark string – ]]>. Polyglot markup is agnostic as to whether one uses character entities or a numeric character references, so long as they are valid. That is, for polyglot markup, there is no difference between & and <.
Fig. 5 Examples of content that is not safe text content
Example 10
```

<style>q::before{content:"<";}</style>
<script>var a = "&";</script>


<style>q::before{content:"&lt;";}</style>
<script>var a = "&amp;";</script>


<style>q::before{content:"\00003c";}</style>
<script>var a = "\u0026";</script>
```
For CSS, the inline safe text content option would work very well most of the time, as < and & are not key parts of CSS and not very often used. But when it comes to JavaScript, the & and the < are key verbs (operators) of the language, and thus one soon runs into trouble – it is better to use external safe text content.

Fig. 6 Inline content containing no ambiguous strings

Example 11

<!-- The following example of inline script is polyglot markup because there are no ambiguous strings within the script element. -->
<script>document.body.appendChild(document.createElement("div"));</script>

Note

A workaround for using ambiguous strings is to include the properly escaped characters inside the src attribute of style or script tags.

4.6.2.2 Safe CDATA content

Polyglot markup accepts raw text content wrapped in a CDATA section; however instead of permitting any content (except the very CDATA end mark string – ]]>), only the subset that corresponds to the particular raw text element’s HTML constraints is permitted. See the “HTML interpretation” column in the parsing differences table above – all the cells with the text ”uninterpreted” are also uninterpreted as CDATA and thus constitutes the safe subset of CDATA.

Wrapping raw text in a CDATA section introduces a new problem: when consumed as HTML, the start and end mark of the CDATA section is seen by the script or stylesheet interpreter and can thus cause syntax errors or even halt the script and stylesheet execution. A solution is to comment out the CDATA start and end marks by using the comment methods of the script or stylesheet language. Additionally, such as when script is used as a coding block container, it may be necessary to even comment out the scripting/styling comments by hiding them inside an XML comment.

4.6.2.2.1 Safe rules for CDATA use

These rules assume that CDATA is of limited use for CSS.

General rules:

The CDATA section is subject to HTML’s restrictions on <script> and <style>.
There can be only one CDATA section per raw text element.
Before the CDATA section there can only be one node - preferrably only one line of code, which may consist of whitespace, or an XML comment, or a construct of the scripting/styling language (usually a comment of the scripting/styling language).
After the CDATA section there can only be one node - preferrably only one line of code, which may consist of whitespace, or an XML comment, or a construct of the scripting/styling language (usually a comment of the scripting/styling language).

The ]]> string:

is always commented out if <![CDATA[ is commented out.
is never commented out if <![CDATA[ is not commented out.
Example 12
```
//]]>  </script>
```

The <![CDATA[ string can be handled in 3 ways:

<![CDATA[ - without commenting it out.
Example 13
```
<script type="not-CSS-and-not-JS"><![CDATA[foo]]></script>
```
Note
Using the <![CDATA[ block without commenting it out is not conforming as type="text/css" or type="text/javascript" content when parsed as HTML.
//<![CDATA[ - using scripting language comments for the entire block.
Example 14
```
<script>//<![CDATA[ FOO; //]]></script>
```
Note that the comment starts in the node before the CDATA section.
<![CDATA[ - Same as 2, but the scripting comment is hidden inside an XML comment.
Example 15
```
<script><![CDATA[ FOO; //]]></script>
```
Note that the scripting language must accept <!-- as syntactically legal. JavaScript does, but other scripting languages may not.

This approach is compatible with CSS; however, rule 2 above prevents validity.

4.6.2.2.2 Comment syntax in `script`

Polyglot markup does not place the opening <script> tag inside comments within a script element. When the HTML parser encounters an opening <script> tag inside comments within a script element, it does not close the element on the next </script> end tag unless a closing comment string (-->) occurs first, for compatibility-related reasons. Alternatively, if the parser doesn’t see any comment end first, the element will be closed on the second </script> end tag. If neither a comment end nor a second <script element end tag is found, the rest of the document is commented out. Note that this behavior does not occur with the style element.

4.6.3 Escapable raw text elements

Escapable raw text elements are elements in which character references are permitted but where the HTML parser treats elements as text rather than as markup. For polyglot markup, escapable raw text elements are:

title
textarea

Polyglot markup uses the same rules of safe text content for escapable raw text elements, except that character entities are permitted for escapable raw text elements.

4.6.4 Foreign elements

The exact rules of for foreign content elements are defined by the respective specifications.

4.6.5 Normal elements

Normal elements have no special restrictions other than those that generally apply to polyglot markup. Note that some elements (such as the iframe element) must be empty in polyglot markup, because the HTML specification sets this requirement on iframe in the XHTML syntax.

4.7 Text

4.7.1 Newlines in `textarea` and `pre` elements

When polyglot markup uses either a textarea or pre element, the text within the element should not begin with a newline.

4.8 Attributes

Polyglot markup surrounds all attribute values with quotation marks. Polyglot markup surrounds attribute values with either single quotation marks or with double quotation marks.

Polyglot markup does not use directly typed newline characters within an attribute.

Within an attribute's value, polyglot markup represents tabs, line feeds, and carriage returns as numeric character references rather than by using literal characters. For example, within an attribute's value, polyglot markup uses 	 for a tab rather than the literal character '\t'. This is because of attribute-value normalization in XML [XML10].

The following example uses numeric character references (escaped characters) for the line feed, tab, and less-than characters within a srcdoc attribute.

Example 16

<iframe srcdoc="&lt;p>Hello &#x0A; &#x09; world!&lt;/p>" src="demo_iframe_srcdoc.htm"></iframe>

Note

Because of attribute-value normalization in XML [XML10], polyglot markup does not use newline characters within an attribute. Practically speaking, for source code with newlines within attributes, DOMs generated via XML and HTML will be different; however, whitespace differences have no behavioral impact on the page unless:

explicitly examined by JavaScript, rendering the differences of small consequence.
used in attributes whose content is rendered visually, such as the content of @alt.

Note that directly typed newline characters are overtly not allowed in any attribute containing a URI.

4.8.1 Disallowed attributes

The following attributes are not allowed in polyglot markup. These attributes have effects in documents parsed as XML but do not have effects in documents parsed as text/html. The HTML5 spec therefore defines them as invalid in text/html documents. [HTML5]

xml:space
xml:base

Note that the xml:space and xml:base attributes are allowed on SVG and MathML elements.

4.8.2 Language attributes

When specifying the language mapping of an element, polyglot markup uses both the lang and the xml:lang attributes. Neither attribute is to be used without the other, and polyglot markup maintains identical values for both lang and xml:lang.

The root element SHOULD always specify the language, or else HTML’s fallback language effect may step in and cause the language to vary depending on whether the document is consumed as XML (where the fallback language is not required to work) or consumed via file URI (where fallback language via external HTTP Content-Language would not work). Note that the internal http-equiv="Content-Language meta element is non-conforming in HTML5. For more, see e.g. HTML5’s language determination rules.

4.8.3 Attributes with special considerations

The following attributes or their considerations require exceptions to the general rules for polyglot markup.

4.8.3.1 The `id` attribute

Polyglot markup does not contain any space characters within the value of an id attribute. This is because values for the id attribute may not contain space characters in HTML5. [HTML5]

4.9 Named entity references

Polyglot markup uses only the following named entity references:

amp
lt
gt
apos
quot

For entities beyond the previous list, polyglot markup uses character references. For example, polyglot markup uses   instead of  . Note that polyglot markup may use decimal values for escape characters (such as   in the previous example); however, the Character Model for the World Wide Web recommends that content SHOULD use the hexadecimal form of character escapes rather than the decimal form when both are available. [CHARMOD]

Polyglot markup always uses character references for the less than sign (<) and ampersand (&) when they are used as characters, however for CDATA inside foreign content and for safe CDATA, the following rules apply:

for script and style elements that contain safe CDATA, they may used as defined by the rules for safe CDATA;
for CDATA sections in a foreign content section (SVG, MathML), the XML rules for CDATA apply;

4.10 Comments

Polyglot markup does not begin a comment with either ">" or "->".

4.11 Scripting and styling polyglot markup

When applying JavaScript and CSS to polyglot markup, the goal is to get the same result whether consumed as HTML or as XML. It is therefore important to be aware of scripting and styling features that give different results in HTML vs XML. These issues comes in addition to the polyglot usage rules for raw text elements.

4.11.1 JavaScript: `innerHTML` vs `document.write()`

Although document.write() and document.writeln() works in HTML, neither function works in XHTML. The polyglot alternative is the innerHTML property, which works for both HTML and XHTML.

Note

The innerHTML property takes a string. However, XML parsers will parse that string as XML in XHTM while HTML parsers parse will parse that string as HTML in HTML. And because of this difference in parsing, the code that innerHTML inserts must follow the guidelines for polyglot markup so that the resulting DOM generated by the XML parser do not differ from the DOM generated by the HTML parser.

4.11.2 CSS: Attribute selectors that require a namespace prefix

CSS allows authors to select elements by referencing their attributes using so called attribute selectors: [attr]{rule:foo}. And for the most part, attribute selectors can be used freely since polyglot markup relies on default namespaces, which do not affect attributes. However, some of the attributes required by polyglot markup, are namespaced – either by default (such as for the xmlns attribute) or via a prefix that by default is namespaced (such as xml:, xmlns:, xlink:). Extension specs might allow even other namespaced attributes than those defined by the HTML specification. As result, a selector such as [xmlns]{rule:foo} will only work in HTML – it will not work in XHTML, where it is a namespace attribute. And the same goes for prefixed attributes – even if one escapes the colon ([xml\:lang]{rule:foo}), such selectors will only work in HTML, except that for the namespace declaration for the xlink: prefix, then it works like in XML even in the HTML syntax and must thus be selected in a namespaced way in both syntaxes.

To be able to select namespaced attributes in XML, the attribute selector must include a namespace prefix. [SELECT]

For the unprefixed, namespaced attribute xmlns, a polyglot selector that works in both HTML and XML can be created by using the asterisk (*) for the namespace prefix, indicating that the selector is to match all attribute names without regard to the attribute's namespace:

Example 17

[*|xmlns]{color:lime}

For prefixed attributes, then, because the rules of polyglot markup as well as the HTML specification itself dictates that the presence of a xml:lang="foo" must be accompanied with a corresponding lang="foo" attribute, then, in a conforming polyglot document, one can use the same approach as for the xmlns attribute.

Example 18

[*|lang]{color:lime}

Note

However, the requirement of polyglot markup to use both xml:lang="foo" and lang="foo" means that even [lang]{color:lime} would work, in both XML parsers and HTML parsers.

When it comes to the xmlns:xlink attribute, which is required for polyglot svg elements, then, because it, in contrast to xml:lang, belongs to a foreign content element in HTML/XHTML, it is namespaced even in HTML. Hence, the only way – in HTML as well as in XML – to use this attribute as a selector, is by declaring the namespace of the xmlns: prefix in CSS:

Example 19


             @namespace xmlns "http://www.w3.org/2000/xmlns/";
             [xmlns|-xlink]{border:dashed lime 3px}

In cases where the user agent does not support namespaces in CSS and/or in markup, it is necessary to use more than one selector. This could happen if the author declares prefixes – default or prefixed – which are an extension specification permits or if the user agent does not support attribute selectors with CSS namespace prefix.

Example 20


            /*Selector for legacy user agents without support for namespace prefixed attribute selector:*/
            [xmlns],
            /*Selector for user agents with support for namespace prefixed attribute selector:*/
            [*|xmlns]
            {color:lime}

Abstract

Status of This Document

Table of Contents

1. Conformance

2. Introduction

2.1 Scope

2.2 Robustness

3. Syntax

3.1 Principles

4. Writing HTML documents

4.1 Processing instructions and the XML declaration

4.2 Specifying a document’s character encoding

4.3 The DOCTYPE

4.4 Namespaces

4.4.1 Element-level namespaces

4.4.2 Attribute-level namespaces

4.5 Element syntax

4.5.1 Required elements and tags

4.5.1.1 A minimal HTML document

4.5.1.2 Required tags examples

4.5.2 Excluded elements and tags

4.5.3 Case-sensitivity

4.5.3.1 Element names

4.5.3.2 Attribute names

4.5.3.3 Attribute values

4.6 Element contents

4.6.1 Void elements

4.6.2 Raw text elements (script and style)

4.6.2.1 Options for delivering safe text content

4.6.2.2 Safe CDATA content

4.6.2.2.1 Safe rules for CDATA use

4.6.2.2.2 Comment syntax in script

4.6.3 Escapable raw text elements

4.6.4 Foreign elements

4.6.5 Normal elements

4.7 Text

4.7.1 Newlines in textarea and pre elements

4.8 Attributes

4.8.1 Disallowed attributes

4.8.2 Language attributes

4.8.3 Attributes with special considerations

4.8.3.1 The id attribute

4.9 Named entity references

4.10 Comments

4.11 Scripting and styling polyglot markup

4.11.1 JavaScript: innerHTML vs document.write()

4.11.2 CSS: Attribute selectors that require a namespace prefix

4.12 Templating restrictions

5. Example document

A. Acknowledgements

B. References

B.1 Normative references

B.2 Informative references

4.6.2 Raw text elements (`script` and `style`)

4.6.2.2.2 Comment syntax in `script`

4.7.1 Newlines in `textarea` and `pre` elements

4.8.3.1 The `id` attribute

4.11.1 JavaScript: `innerHTML` vs `document.write()`