Serialization dependent processing differences
HTML5 supports two separate serializations. This page aims to exhaustively list the differences in the DOM, CSS and other processing differences that arise because of different serializations. The goal is to thoroughly understand those differences to either minimize them for the final recommendation or provide authors with documentation on how to cope with those differences. Throughout this article the term 'text serialization' refers to any serialization conforming to the legacy HTML serialization and permissible to be delivered with the media type 'text/html' while 'xml serialization' refers to any XML conforming and HTML5 conforming serialization of HTML and permissible for delivery with the media type 'application/;xhtml+xml'.
- The 'tbody' and 'colgroup' elements: The current draft (and XHTML 1.0 and 1.1) do not require tbody or colgroup elements in the XML serialization. This differs from the text serialization in that the tbody and colgroup elements are required though their opening and closing tags may be omitted. Requiring tbody and colgroup elements in both serializations (though implied in the text/html serialization) would avoid this confusion.
- The 'body' (and 'head') element: For compound documents, it has been proposed to allow the omission of 'head' and 'body' elements within compound documents. In such compound documents the 'html' element would simply adopt the content model of the 'body' element. Such a content model change could effect DOM scripts and CSS selectors that make assumptions about a tbody element existing.
- Structured inline content in the 'p' element: The XML serialization permits the use of structured inline content such as BLOCKQUOTE, lists, and tables within a P element.
- Normally CSS calls for the the canvas surrounding a document to be painted according to the document element. So in the XML serialization that has often been construed to mean the HTML element's background should be used to paint the canvas. In HTML4, it called for the body element's background to be used to paint the surrounding canvas. If this difference persists in HTML5, this would constitutes another difference authors must deal with. Making the body background and the HTML background the same would work around the issue.
- No implicit 'tbody' or 'colgroup' elements in XML serialization
- No implicit 'body' elements
The following APIs are currently defined in the draft to be serialization dependent (or HTML document flag dependent). Perhaps a better approach would be to try to make these methods serialization agnostic or — if this would break existing content — introduce new serialization aware methods and dprecate these methods.
- document.write(): For example a document.write method could include an argument declaring the serialization of the write string.
- createElementByName(): Most implementations are converging on creating elements with a namespace. Interoperability issues persist where the root element or the default namespace for the document/document fragment differ or are not in the HTML namespace.
- innerHTML(): These methods have HTML in their name. It may make sense to treat this as text/html and add innerXML and outerXML methods for accessing serialization aware strings.
While technically not a serialization difference per se, the document.createElement() method does have other issues that have traditionally arisen due to the common conception that XML serializations have namespaces and that the text/html serialization does not. If HTML5 adds namespaces to the text/html serialization than the same issues will apply to both serializations. Also since the major browsers have implemented document.createElement() to either 1) always create an HTML element; 2) create an element in the document.element's namespace or 3) create an element in the default document.element's default namespace, then this is also not an issue for HTML host documents regardless of serialization.
The document.createElement method was never clearly defined for a namespace aware serialization such as XML 1.0 with namespaces. The suggestion to add namespaces to the text/html serialization brings similar confusion to the text/html serialization. Several approaches have been pursued to retroactively make document.createElement work in a namespaced document:
- create a null namespaced element. This is the approach of the DOM 3 Document Object Model Core recommendation.
- create an element in the namespace of the root element.
- create an element in the default namespace for the document (the namespace declared by the un-prefixed xmlns= attribute)
- create an element that is an XHTML namespace element if the document is flagged as an HTML document or if its media type when it was created indicated text/html (or maybe application/xhtml+xml too).
An alternative may be to move the createElement method to the DOMCore element interface (element.createElement()). This way the element would be created by a specific element with a namespace already defined (or even null as the case may be).
document.write and document.writeln
The other DOM method with difficulties in XML and even in legacy text/html serializations is the document.write method. In certain circles document.write is already deprecated. It may be worthwhile to consider deprecating it in HTML5 as well. However, are there ways to rescue this method already familiar to authors while make it still safe to use and without breaking existing content? Once solution may be to make it work through the setInnerHTML or setInnerXML methods so that it continues to work in XML de-serialized documents.
A big disadvantage of document.write is that it generally requires a complete reparse of a document. So making it work with setInnerHTML would make it work better even in the text/html serialization. The advantage of document.write is that it can be made to work from a script anywhere in the document and it will add the necessary serialized HTML in-place. To make setInnerHTML work in a similar way it might be useful to add a DOM attribute to embedded scripts that exposes the scripts enclosing script element introspectively to the script. In this way the script might invoke something like thisElement() and have the script's containing element returned to it. From there it can use setInnerHTML or any methods by walking the DOM tree relative to its containing element.
With a thisElement() DOM attribute, the document.write method would be re=implemented to accomplish the same thing the author expects, but without re-parsing the whole document. This might make optimizations and threading techniques possible that would not otherwise be possible.
Since XML documents typically make use of XML namespaces and text/html serialized documents follow no clear recommendation on the issue of namespaces many serialization processing differences arise from the interpretation of the colon (":") used to separate a namespace declared prefix from the local name. For CSS, DOM methods, DOM attributes, and other processing in a non-namespace-aware document, the appearance of a colon will typically be treated as just another character in the tag or attribute name and no significance will be given to the portion before the colon. Of course non-namespace processing could smooth over this issue by applying analogous processing of prefixes even for non-namespace documents (e.g., allowing CSS selectors to apply to prefixes, treated as prefixes even though the document is not officially a namespace document). More investigation of existing implementations and their processing of non-namespaced documents is required.
- The pattern '/>' is not a parse error in the text serialization. While this editing diff to the draft indicates this is a move away from XML serialization compatibility, the pattern '/>' is not a parse error in XML either. The only pattern not allowed without escaping in XML CDATA sections is the sequence ']]>'. Perhaps the editor refers to something else in legacy SGML.
- CDATA sections are explicitly marked-up in XML serializations while in the text/html serialization the STYLE and SCRIPT elements are defined to contain CDATA. (should we recommend or require CDATA section handling in the text/html serialization or make no mention of it?)
Document character set
The document character set changed from XML 1.0 to XML 1.1. With XML 1.1, XML now allows all Unicode characters that are not surrogate characters nor the null character U+0000. However, also with XML 1.1, numeric character references for the C1 control characters must be used whereas in XML 1.0, these characters could appear directly in the serialized document. Also with XML 1.1's addition of the C0 control characters, these too must be included only through character references.
The text/html serialization of HTML5 allows the C0 control characters which XML 1.0 does not allow and XML 1.1 does allow. XML 1.1 allows these C0 control characters as character references, but it also requires an additional 33 C1 control characters as character references in the range U+007F – U+009F, whereas XML 1.0 did not require character references for these characters and the HTML5 text/html serialization maps these numeric character references to other Unicode characters.
). While these characters are permitted as part of the document conformance norms, they're meaning is undefined by the HTML5 recommendation. They are not explicitly included as whitespace characters though some of them do play that role in the HTMl5 parsing algorithm and other places.
The following table lists several key ranges of Unicode characters that differ from one serialization to the next.
- 'X' indicates the character is not allowed in any way
- 'R' indicates the character is only allowed as a character reference
- 'N' indicates the character is only allowed as as a literal and not also as a character reference
- 'W' indicates the character is subject to whitespace handling. When subject to whitespace handling, the character will be eliminated through whitespace normalization unless included as a character reference.
- *With the text/html serialization authors cannot include a carriage return U+000D either as a literal nor as a character references. Both methods of inclusion will be swapped for a line feed (U+000A) through the parsing process.
- †While document conformance requires these characters to be included through a character reference, the text/html parser will, at times, recover from this error and include the literal & or < characters.
|Characters||count||XML 1.0||XML 1.1||HTML5 text/html|
|C0 null character: U+0000||1||X||X||X|
|C0 control characters: U+0001 – U+0008||8||X||R||A|
|C0 tab: U+0009||1||A,W||A,W||A,W|
|C0 line feed: U+000A||1||A,W||A,W||A,W|
|C0 vertical tab: U+000B||1||X||R||A,W|
|C0 form feed: U+000C||1||X||R||A,W|
|C0 carriage return: U+000D||1||A,W||A,W||A,W*|
|C0 control characters: U+000E – U+001F||18||X||R||A|
|PCDATA Ampersand '&' U+0026||1||R||R||R†|
|PCDATA Less-than sign '<' U+003C||1||R||R||R†|
|CDATA Ampersand '&' U+0026||1||N||N||N|
|CDATA Less-than sign '<' U+003C||1||N||N||N|
|Attribute value Ampersand '&' U+0026||1||R||R||R†|
|Attribute value Less-than sign '<' U+003C||1||R||R||R†|
|C0 delete: U+007F||1||X||R||A|
|C1 control characters: U+0080 – U+0084||5||A||R||N|
|C1 control characters: U+0085||1||A||A,W||N|
|C1 control characters: U+0086 – U+009F||26||A||R||N|
To summarize the table:
- 29 C0 control characters are not allowed in the XML 1.0 serialization, but allowed in the other serializations
- 32 C1 control characters are allowed as:
- References only in XML 1.1
- Non-reference listeral characters in HTML5's text/html serialization
- Either as literals or references in XML 1.0
- To include a literal carriage return (U+000D) authors will need to use an XML serialization and include it as a character reference or insert it through the DOM
- Aside from the characters considered in the table all other legal Unicode characters are permitted in all serializations from U+0000 to U+10FFFF. Note therefore that HTML5 in its DOM and all serializations excludes ISO10646 characters beyond U+10FFFF, permitted in encodings such as UCS-4.
- Also the 2,048 surrogate characters are allowed in the text/html serialization, however they are a parse error.
Summary of current serialization remedies
- Avoid CDATA sections by linking to scripts and stylesheets
- Avoid document.write()
- Avoid document.createElement() in XML in older browsers (check for XML serialization and use createElementNS() instead)
- Always include explicit 'tbody' and 'colgroup' elements.
- Always set the same background properties on 'html' and 'body' elements.
- Avoid Unicode C0 and C1 control characters except for: tab (U+0009) and line feed (U+000A).
- ParseErrors: Parse errors
- Differences in DOM method return values
- Namespaces in text/html
- Quirksmode triggers and behaviors
- ThoughtExperimentInGracefulDegradation: Parsing unknown elements: a thought experiment in graceful degradation