Rationale for '“Polyglot Markup: A robust profile of the HTML5 vocabulary”' on the Recommendation track
Summary: The specification Polyglot Markup: A robust profile of the HTML5 vocabulary (shortname HTML-POLYGLOT) should be considered for the Recommendation track.
New data since 2012
Information about developments in 2013.
- The editors have resolved all bugs, and taken note of most of the objections that have been raised against HTML-POLYGLOT. For instance:
- After input originating from Henri Sivonen and discussed in the TAG, a scope section which clarifies that when it can be useful and when it is not useful to follow the HTML-POLYGLOT profile, has been added.
- After input from Lachlan Hunt and others, HTML-POLYGLOT now describes the rules for how to use CDATA in
styleelements. This allows tools to generate polyglot markup on the fly even in situations when placing (non-polyglot) content in external files is not an option. Thus, it has become simpler to produce polyglot markup.
- After input on public-html, the principles section is now normative
- Also after input from public-html and bugzilla, the principles and introduction better explain why HTML-POLYGLOT in some situations goes further than a pure DOM equivalence consideration requires. E.g. the choice of UTF-8 as the sole encoding is explained. Another way to put this is that the weight on DOM equivalence has been balanced with more weight on robustness. This is eviden reflected in a new subtitle ('“A robust profile of the HTML5 vocabulary”').
- The spec has gotten some RFC 2119 language.
- A number of - perfect and imperfect – implementations have emerged or come to our attetion. This is an incomplete list - please feel free contact us to add more/update this information:
- WYSIWYG editor BlueGriffon has added a polyglot mode.
- Sam Ruby’sWunderbar - see homepage
- Sam Ruby’sPlanet Venus - see homepage
- The XHTML5 output of the HTML generator Freeway Pro defaults to produce polyglot markup (if encoding is set to UTF-8).
- At least two XML schemas for polyglot markup
- The CSE HTML Validator is reported to have a polyglot mode
- The HTML5-mode of XMLMind’s XMLeditor seems to follow polyglot markup. (That is: the HTML5 mode - not the XHTML5 mode.)
- The XHTML5 mode of the elxis cms seems to be polyglot.
- As an observation, and partly thanks to new additions to the HTML5 spec (e.g. see bug 23587) the W3C NU HTML validator has become, and will become, simpler and simpler to use for polyglot validation. (But one should be willing to run a double validation - once for HTML5 and once for XHTML5.)
On the topic of making Polyglot Markup a recommendation:
- While we might not recommend that all authors (or in fact that many authors) should create Polyglot documents, we should Recommend that when authors want to create polyglot markup they do so by following the authoring requirements outlined in this specification. In fact, the introduction of HTML-POLYGLOT does state that
All web content need not be authored in polyglot markup."
- HTML-POLYGLOT does include normative language that can be followed precisely and a document can be measured objectively against those requirements to see if it successfully adheres to the polyglot markup rules. It is possible to build a validator that checks a document to see if it is a valid polyglot document according to HTML-POLYGLOT and to identify the normative requirements that have been violated should the validation fail.
- There is precedent for guidance documents including authoring guidance to be published as Recommendations. For example:
- To publish as a Working Group Note would send the signal that “work has ended“ on this particular topic (see comment by Mike). Such a signal would be wrong, for example because bugs continue to be filed.
Why normative language
On the topic of using normative language in the specification:
- The purpose for using normative language in the specification is to make it clear which parts are necessary to conform to the specification and which parts provide advisory or informative content. HTML-POLYGLOT is intended to make it possible to objectively determine if a document adheres to its requirements and therefore it is appropriate to differentiate between normative and informative parts.
- It seems evident from an evaluation of Appendix C that C’s lack of normative language and exact rules had bad effects. A document that operates with normative language has a greater chance of becoming coherent than one that does not operate with normative language. For instance, if a section of the document operates with MUST-level requirement for a particular syntax, then the chance that another section would contradict, is quite small simply because normative statements are likely to be checked for coherence. However, exactly that kind of inconsistency is found in Appendix C, and probably because the section did not define anything normative. Normative language should and will often take the form of a list, akin to a recipe. XHTML 1.0’s section 5 made clear that Appendix C was informative, only. Thus, strikingly, the informative status of Appendix C may have made it less informative. And hence, making sure the spec uses normative language seems like a relevant strategy as long as the goal is to produce a spec that is coherent and of practical value.
- Authors are going to use XHTML5 syntax for text/html, hence it is of benefit that there is a normative spec for how to do it. And there is no other specification that takes care of defining an HTML-compatible XHTML profile. For instance, Appendix C took the attitude that User agents had to change, rather than authors and authoring tools.
Why good value
On the topic of the value of promoting polyglot markup:
document.write— which is also not used by Polyglot Markup. Not only does this example show that there can be real value in a conservative spec, but it also shows that there is a market for such spec, for which the HTMLwg should offer real value. And — by the way – the effect of this does not need to be that XML gets more attetion — it could just as well lead to an attetion to the secure subset of the
- One syntax, two serialization is a feature. A tool vendor could serve two usergroups via one syntax. And this, in turn, has the potential of simplfying the tool for its users, as it ought to allow the vendor to skip poking the user to make choices about character encoding or markup format.
- It keeps the XHTML simple. While Polyglot Markup also adds requirements (like
<!DOCTYPE html>) to XHTML5, overall, the HTML-compatibility requirements holds XHTML in the ears and keeps it simple. E.g. it forbids non-UTF-8, it forbids the XML declaration and so on. And best of all: This is not an artificial extra but an effect of HTML5’s design — Polyglot Markup is merly picking the fruit.
- It adds pedagogical value. While perhaps not being something that alone makes it worth sending Polyglot Markup for Recommendation, the single syntax pedagogically highlights on one side, how HTML5 itself is designed to be XML-compatible and often permits the XML syntax within HTML documents and, on the other side, it can be an educating piece regarding how XML and HTML differ with regard to the DOM and in light of Appendix C’s different approach to whether it is XML or HTML that should adapt.
- It places HTML5 and its syntaxes in a balanced perspective. The HTML5 process has given pure HTML syntax much attention and respect, and circumstantial evidence (e.g. consider the widely used HTML5 Boilerplate [of which there today isn’t any polyglot version]) tells us that pure HTML syntax has grown in popularity. The HTML5 spec also emphasizes XHTML5 as a serialization of its own — which is also justified both in principle and as a reacton towards history (Appendix C) and future (XHTML is today well supported by all UAs of the vendors of the HTML working group). But this new attention to the different natures of HTML and XHTML including the justified critique towards e.g. XHTML (version 1.0) served as HTML, means that many authors probably are unaware of what HTML5 does with regard to XHTML5, and even more unaware that HTML5 contains enough data to define HTML-compatible XHTML5-profiles. Polyglot Markup thus sets a valuable light on this third way — or “the both ways” as we could call it, helping potential users (authors/tool vendors) to discover that the HTML5 spec also contains this opportunity and thereby also bringing a balanced perspective on the issue of the HTML5 vocabulary and its two syntaxes.
Why no C
If there is “little consensus” around polyglot then, on the flip side, this could prevent a history repeat. But more seriously and though one cannot really predict perception, it is the background and goals of Polyglot Markup compared with those of Appendix C, that are the real reasons why history won’t repeat.
- C-mantics vs real, new semantics. XHTML 1.0’s offer was an XML replica of HTML4. Thus, to the extent that XHTML 1.0 was perceived and touted (note that the URL ends with the acronym “html” — and not “xhtml”) as the latest and greates version of HTML, the syntax naturally had to attract. For HTML5, by contrast, it is the new vocabulary that is the attraction.
- C’s different context. The scope of Appendix C’s “parent spec” — XHTML 1.0, was very different from that of Polyglot Markup’s parent spec. Which is one reason why Appendix C and Polyglot Markup end up with different answers for the same questions. Take the question of the character encoding, which Polyglot Markup limits to UTF-8 and put restrictions on with regard to how it can be declared. Appendix C section 1 and section 9 could, together, have lead up to a similar conclusion as that of Polyglot Markup. But it did not happen. XHTML 1.0 as well took the stance that “there is no requirement for XHTML 1.0 documents to be compatible with existing user agents”, whereas the HTML5 spec follows several strategies to keep HTML and XHTML in sync.
- C’s melting pot perspective on the format question. Subsection 5.1 on the media type tellingly talks about
text/htmlbefore it talks about
application/xhtml+xml. And also, XHTML 1.0 several places speaks of HTML-compatibility as a legacy issue rather than a format issue: “backward compatible when serving XHTML documents as media type text/html”. The spirit very much seems to be that the two languages will — and should — melt together on the parser level as soon as parsers “adapt” to behaviors of the other serialization, or stop being “legacy browsers” by starting to ignore certain constructs from the parallel syntax. For Polyglot Markup, in contrast, the acceptance of the two formats are reflected in its title — “Polyglot Markup”, but also rooted in HTML5 proper — whose subtitle is “A vocabulary and associated APIs for HTML and XHTML“.
- C was not conservative. Appendix C.1. warns against processing instructions – for HTML-compatibility, and yet C.14. reccommends them – for XML-compatibility. The result is not a single robust and conservative syntax but instead a syntax whose requirements vary according to choice of character encoding and stylesheet etc.
- C’s DOM from the future. Polyglot Markup perhaps most important expression of the robustness principle is its identical DOM-principle, which in practise means that the XML has to adapt to HTML. (To be “flagging implied tags” is considered useful even by Henri, but note that Henri is wrong in assuming that XHTML 1.0 Appendix C caused any such flagging, see below.) For contrast, by ruling that authors were not required to represent by markup all elements that HTML parsers auto-generate in the DOM, Appendix C took the opposite view — that authors do not have to achieve identical DOMs! See C.11: “Rather than require document authors to insert extraneous elements, XHTML has made the elements optional. User agents need to adapt to this accordingly”. By contrast, Polyglot Markup requires authors to be conservative in what you send and does thus not ask anything extra from the XML parser but requires that XML-compatibility is achieved by representing all elements in the markup. (For example, it does not rely on extending XHTML5 to handle named entities.)
- C’s low concern for conformance. Some syntax rules relate to well-formedness while others just makes sense but has (in theory/principle) no effect on parsing. For example, it does not make sense to place
<meta http-equiv="Content-Type" Content="text/html;charset=UTF-8" />inside an XHTML document . But to use it in XHTML anyhow, does in principle (read: in a conforming parser!) not break anything other than the formal conformance rules themselves. Appendix C could thus be said to — in practice — have used the pure conformance rules of HTML as an extension oportunity. By contrast, Polyglot Markup (and HTML5 proper for XHTML5!) only allow
<meta charset="UTF-8" />. This is because Polyglot Markup adheres to the conformance rules for HTML5 as well as the conformance rules of XHTML5, and the resulting syntax is not only well-formed but does as well also makes sense in both serialization. Appendic C was a profile of a specific XHTML syntax but not a profile of a specific HTML syntax, whereas Polyglot Markup is a profile of both the specific syntax of XHTML5 and the specific syntax of HTML5.
- C’s unrealized potential. One motivation of Polyglot Markup is that the strict rules indirectly promotes best practise coding styles for both XML and HTML: external scripts, external stylesheets, UTF-8, simplicity, no valid XML – only well-formed XML, only no-quirks mode, no magical elements (such as
<noscript>) and so on. When we look at to which extent Appendix C — for example in order to support other encodings than UTF-8 — opens up for XML-specific code in HTML and how it makes authors expect HTML-specific behavior in XML, then one must conclude that these side-effects were obviously not the motivation behind Appendix C.