Review on Polyglot Markup Draft from Lachlan Hunt on 2010-07-28 (public-html@w3.org from July 2010)

From: Lachlan Hunt <lachlan.hunt@lachy.id.au>
Date: Wed, 28 Jul 2010 16:18:37 +0200
To: public-html <public-html@w3.org>
Message-ID: <4C503C3D.6090409@lachy.id.au>
Hi,
   This is my review of the current Polyglot Markup draft.

My first problem is that it purports to be a normative document with 
normative requirements and references.  It should instead be an 
informative note describing the requirements that are derived from the 
intersection of HTML and XHTML requirements as defined in HTML5.  I hope 
the intention is for this draft to eventually be published as a WG note 
and is not on the Rec track.  (I was referred to bug 9969 in IRC for 
this issue, and so I will document my rationale for this more fully 
there later)

*Character Encoding*

The draft states:

   "When polyglot markup uses UTF-16, it should include the BOM
    indicating UTF-16LE or UTF-16BE."

I realise that text is copied from an e-mail I wrote myself on the topic 
a while ago, but the description is slightly misleading with regards to 
what UTF-16LE and UTF-16BE are, and should be rephrased.  I suggest it 
be rephrased like this:

   When polyglot markup uses UTF-16, the Byte Order Mark (BOM) must be
   included.  The BOM is used to indicate whether the encoding is
   big-endian or little-endian.

(You could also omit the second sentence from that, as it may not be 
necessary to provide that bit of trivia to readers.)

   "In addition, polyglot markup need not include the meta charset
    declaration, because the parser would have to read UTF-16 in order
    to parse it by definition."

This too should be updated to state that, at least per the current spec, 
inclusion of the meta charset declaring UTF-16 (or any other non-ASCII 
compatible encoding) is forbidden.

   "Use both the XML Declaration and meta tag to specify the appropriate
    character encoding."

This is wrong.  The XML declaration cannot be used.  This requirement 
contradicts the previous section in the draft where it is correctly 
noted that "Processing Instructions and the XML Declaration are both 
forbidden in polyglot markup."

Remove the incorrect advice from this section, and state that only UTF-8 
or UTF-16 may be used.  Technically you could also say that other 
encodings can be used if declared at the protocol level (Content-Type 
metadata), but such advice if included should be accompanied by a strong 
warning to authors to avoid alternative encodings.


*The DOCTYPE*

I suggest you provide an example illustrating the about:legacy-compat 
DOCTYPE.

The list of rules for the DOCTYPE syntax should state that it must 
conform to the rules for XML DOCTYPEs.

   "Polyglot markup may use any other XHTML document type declaration
    with a referenced DTD,..."

This is incorrect.  The list of XHTML DOCTYPEs permitted for use in 
HTML5 content are only those listed as obsolete but permitted.  This 
includes XHTML 1.0 Strict and XHTML 1.1.

The use of any other DOCTYPE is not permitted in polyglot HTML5, because 
no other XHTML DOCTYPEs are considered conforming in HTML5.  Such 
DOCTYPEs can be used in XHTML-only documents, where there are no 
restrictions on the permitted DOCTYPEs.  But such documents are not to 
be considered conforming polyglot documents.

   "However, note that by using a document type declaration that
    references a DTD, the document is required to follow the rules of
    the DTD. The rules of the DTD may or may not be compatible with
    polyglot markup."

That is not a requirement imposed by the HTML5 specification.  The point 
of permitting the limited set of obsolete DOCTYPEs is to assist with the 
transition period, so that new HTML5 features can be incorporated into 
existing pages, and still claim conformance with HTML5.  The 
requirements of their respective obsolete specs are not relevant to an 
HTML5 conformance claim.


*Namespaces*

   "... The prefix must be declared on an SVG or MathML element by using
    an attribute in the xlink namespace or on any of its SVG or MathML
    ancestors."

That statement does not make sense.  What does it mean to declare the 
prefix "by using an attribute in the xlink namespace"?  I believe the 
statement is just trying to state that the prefix must be declared 
before xlink:href can be used.


*Case Sensitivity*

Element Names:
   "Polyglot markup uses the correct case for element names."

Please refer to this as the "canonical case".  This also applies to the 
Attribute Names section too.

Attribute Values:

This section lists a set of attributes for which their values are 
supposedly case sensitive and require lowercase values, which is not 
true.  The list itself appears to be derived from the requirements of 
case insensitivity of attribute selectors in the spec, as applied to 
HTML elements in HTML documents.

In HTML5, that list is specifically written as user agent requirements 
for selector matching.  You cannot directly derive document authoring 
requirements from this list.  However, by attempting to do so, the list 
imposes some requirements on authors for which there are no such 
requirements in the spec.

For the purpose of selector matching, attribute values in XML are all 
treated case sensitively (except where noted in the user agent style 
sheet). But for the purpose of deriving semantics, most of the listed 
attributes are all defined to have ASCII case-insensitive values.

The only exception is the type attribute on ol elements, which is always 
treated case sensitively, but this is not unique to either HTML or XHTML 
and the attribute is non-conforming anyway, and so it is not relevant 
for polyglot documents.

I recommend you modify the section to note the case sensitivity of all 
attribute values for the purpose of selector matching, and recommend but 
not require the use of lowercase values for all attributes with values 
that are, enumerated, MIME types, language tags, charsets, boolean, 
media queries, or keywords.

These are the conforming attributes that have case-insensitive values:

* accept
* accept-charset
* charset
* checked
* defer
* dir
* direction
* disabled
* enctype
* hreflang
* http-equiv
* lang
* media
* method
* multiple
* readonly
* rel (for values that don't contain a colon)
* scope
* selected
* shape
* target (keywords only; browsing context names are case-sensitive)
* type on a, link, object, script, style
* type on input

All the rest of the attributes listed in this section of the current 
draft are non-conforming.


*Empty Elements*

The HTML5 specification refers to these as void elements in order to 
distinguish them from elements that happen to have no content.  Please 
refer to void elements instead of empty elements here too.

   "The alternative syntax <br></br> allowed by XML gives uncertain
    results in many existing user agents."

This document should not concern itself with the uncertainty of legacy 
browser behaviour.  If anything, it should instead note how HTML5 
requires </br> to be handled and state that its use is forbidden.

-- 
Lachlan Hunt - Opera Software
http://lachy.id.au/
http://www.opera.com/
Received on Wednesday, 28 July 2010 14:19:11 UTC