From HTML WG Wiki
The root element
Proposed redraft of "The root element" subsection
The following emails include a redraft of the subsection to:
- clarify the useage of xmlns and xmlns: prefixed elements
- add requirements for authors and authoring UAs to include @dir and @xcml:lang on the root element
- modifies the content model and other norms when HTML is used in compound XML documents
Concerns were raised about the value of the xmlns attribute in text/html serialized documents
- xmlns must be set to xhtml URI
- xmlns only there due to errant authoring tools
- xmlns value stricter in current draft
Basically these suggest making the document conformance criteria require a value for xmlns of "http://www.w3.org/1999/xhtml" in the text/html serialization.
This draft does not include any mention of the @charset attribute discussed below. The suggestion below with @charset is merely to include UA conformance criteria to look for @charset value on the root element, by changing the pre-parsing algorithm. Since this is not backwards compatible it is not recommended for document conformance criteria yet: only UA conformance criteria.
charset attribute and encoding
This review section relates to the "root element" subsection of the draft and also separate "Document metadata" and "The input stream" subsections.
In general, I think it would be best to encourage authors to use BOM detection compatible encodings (UTF-8, UTF-16 and UTF-32) and to use BOM only: particularly if existing UAs handle encoding detection from BOM adequately.
If we do change the author guidance on this, we should consider adding charset attribute to root element rather than adding a charset attribute to the <meta> element. This will be easier for authors to use. It will also be easier for UAs to pre-parse. (Rob Burns)
Adding the following paragraph to the encoding detection algorithm could enable this in HTML5 conforming UAs:
"A sequence of bytes starting with: 0x3C, 0x68 or 0x48, 0x54 or 0x74, 0x4D or 0x6D, 0x4C or 0x6C, and finally one of 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x20 (case-insensitive ASCII '<html' followed by a space)"
More testing is required for: 1) current UA encoding detection from BOM only.2) encoding detection with meta@charset 3) coding detection from html@charset. In particular test should use non-UTF encodings, since UAs may be most savvy about detecting these encodings. Using, for example, 8859-9 as the actual encoding and then mis-declaring another encoding (e.g., windows-1252), will provide the most robust test of current UA behavior.