11909 – The principles of Polyglot Markup - validity? well-formed? DOM-equality?

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 11909 - The principles of Polyglot Markup - validity? well-formed? DOM-equality?

Summary: The principles of Polyglot Markup - validity? well-formed? DOM-equality?

Status:	CLOSED FIXED

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	LC1 HTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff) (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 major
Target Milestone:	---
Assignee:	Eliot Graff
QA Contact:	HTML WG Bugzilla archive list

URL:	http://dev.w3.org/html5/html-xhtml-au...
Whiteboard:
Keywords:

Depends on:
Blocks:	11910
	Show dependency tree / graph

Reported:	2011-01-28 13:15 UTC by Leif Halvard Silli
Modified:	2011-08-04 05:07 UTC (History)
CC List:	7 users (show)

See Also:

Attachments

Description Leif Halvard Silli 2011-01-28 13:15:04 UTC

PROPOSAL:

Suggest having a *normaltive* scope description of Polyglot Markup, and I am suggesting the following:

]] Polyglot Markup describes a HTML5-valid (validity), HTML5-comaptible (well-formedness), XML-well-formed (well-formedness), DOM-equal (DOM equality) subset of HTML5. It does not, however, occupy itself with XML-validity. XML-compatible when necessary for well-formedness reasons. But always both HTML-valid and HTML-compatible. [[

This could go into the intro or in a new paragraph. It would be ideal to establish a vocabulary which could be used throughout the spec. Then one could say "To use <colgroup> is a DOM-equality issue". OR "<p/> cannot be used because it isn't HTML5-valid". Or "An @id cannot begin with a number for XML-validity reasons". (XML 1.0 has a similar section where it defines what e.g. well-formed and valid etc means.)

CURENT STATUS

Currently, the principles of Polyglot Markup can be gleaned from the Abstract ("identical document trees" etc), from the Introduction ("valuable to be able to serve HTML5 documents that are also well formed XML documents" and from the title of the spec ("HTML-compatible XHTML documents").

DISCUSSION

Regarding XML-validity: For example <div id="999"></div> is valid HTML5. But it is invalid (but well-formed) XML. If we (as I suggests) do *not* want it to be XML-valid, then this should be said. May be polyglots should strive to be XML-valid also? However, since the weight is on being HTML-compatible rather than XML-compatible, then this is an argument in favour of ignoring XML-validity and instead putting the weight on HTML-compliance. But then we should be conscious about it and state it in the draft.

According to Henri Sivonnen, the Polyglot spec should only describe a subset of XML1 and HTML5. We should only read the specs and pick what is compatible with both specs. But which subset?

* Validity subset: The HTML-valid subset? The XML-valid subset? The HTML + XML-valid subset?
* Well-formed subset?
* Well-formed and valid?
* DOM equal subet?
* All the above?

The two main problems in this list are: DOM equality (this is not described in a spec that we can look at) and XML-validity (should we care?). But also, to a degree, HTML-validity/-conformance. It seems like HTML-conformance/-validity should not count as as important as HTML-compatibility.

PROBLEM EXAMPLES:

<colgroup>: The draft says that polyglot markup *requires* <colgroup/>, or else the XML dom will be different from the HTML DOM. OK. But then we are outside both validity and well-formedness - then we are in the "equality" land. Which isn't described in any other standard, which we can formulate a subset of. It is Polyglot Markup's task to describe the DOM equal subset.

<xmp> and <plaintext>: to discuss those elements inside Polyglot Markup shows an emphasis on equality, rather than validity (they are HTML5-invalid) or well-formedness (they have no XML-well-formedness problems). The only problem is that they work differently in HTML and XHTML.

attributes - line-feeds, tabs and CR inside attributes: this is not whether a validity issue or a well-formed issue. It is purely - and only sometimes important - DOM equality issue.

@id: XML has some global validity rules for @id. For instance, an @id may not begin with a number. Should it matter to Polyglot Markup?

Comment 1 David Carlisle 2011-01-28 13:49:56 UTC

(In reply to comment #0)

> * Validity subset: The HTML-valid subset? The XML-valid subset? The HTML +
> XML-valid subset?
> * Well-formed subset?
> * Well-formed and valid?
> * DOM equal subet?
> * All the above?
> 
>

there is no dtd for html so xml valididy can't be an issue.

I think the polyglot spec should assume the document is conforming html and well formed xml. Thus the only rules it needs to discuss are those aimed at making DOM equal (or equal enough (eg CDATA in script)) for html or xml parsing.

Comment 2 Leif Halvard Silli 2011-01-28 15:53:43 UTC

(In reply to comment #1)

> there is no dtd for html so xml valididy can't be an issue.

Ok ... right. I misread what "all" refers to in this sentence of XML 1.0:

]] Validity constraint
  [Definition: A rule which applies to all valid XML documents. [[

I failed to follow the link on "valid", which leads to this:

]] [Definition: An XML document is valid if it has an associated document type declaration and if the document complies with the constraints expressed in it.] [[

Thus I were under the misunderstanding that e.g. <p id="666"/> would be invalid in "all" XML, even without DTD. (What I must admit I find a bit strange is that the mere presence of a DTD, regardless of what the DTD says, would cause <p id="666"/> to be invalid ...)

Thus Polyglot Markup needs only to say that, as long as (or because/when) there is no DTD, then XML-validity is not an issue. But I would not mind if it also said what to remember when/if there *is* a DTD. After all, the goal is to have an equal experience also in that circumstance.

> I think the polyglot spec should assume the document is conforming html and
> well formed xml. Thus the only rules it needs to discuss are those aimed at
> making DOM equal (or equal enough (eg CDATA in script)) for html or xml
> parsing.

To say that Polyglot Markup describes a DOM-equal subset of conforming HTML and well-formed XML, sounds like a good description of the principle(s). From that definition it should be easy to understand what "HTML-compatible XHTML" means. May be the spec should say that when it says "HTML-compatible" then it means "DOM-equal".

The spec could then explain that the rules for conforming HTML are found in HTML5. And also say that the DOM to which polyglot markup needs to adapt, is also described in HTML5. But that the rules for well-formed XML are found in XML. The spec could then, as you say, go on to discuss the consequenses of these rules and principles.

I hope that this can be dealt with more systematically in the spec.

Thus, the spec should not mention <xmp> and <plaintext>. Or, if it mentions them, then it should make clear that they are invalid in HTML5 and that they are impossible to include in a DOM equal polyglot. (Well, xmp is possible, as long as one ignores the purpose of it in the first place.)

Comment 3 David Carlisle 2011-01-28 16:05:31 UTC

(In reply to comment #2)
>(What I must admit I find a bit strange is that
> the mere presence of a DTD, regardless of what the DTD says, would cause <p
> id="666"/> to be invalid ...)

that would be strange, but is not the case, there are no special rules for attributes of name id in XML. If an attribute is declared of type ID (whatever its name) then some additional validity rules apply.

> 
> Thus Polyglot Markup needs only to say that, as long as (or because/when) there
> is no DTD, then XML-validity is not an issue. But I would not mind if it also
> said what to remember when/if there *is* a DTD. After all, the goal is to have
> an equal experience also in that circumstance.
> 

> To say that Polyglot Markup describes a DOM-equal subset of conforming HTML and
> well-formed XML, sounds like a good description of the principle(s). From that
> definition it should be easy to understand what "HTML-compatible XHTML" means.
> May be the spec should say that when it says "HTML-compatible" then it means
> "DOM-equal".
> 
> The spec could then explain that the rules for conforming HTML are found in
> HTML5. And also say that the DOM to which polyglot markup needs to adapt, is
> also described in HTML5. But that the rules for well-formed XML are found in
> XML. The spec could then, as you say, go on to discuss the consequenses of
> these rules and principles.
> 
> I hope that this can be dealt with more systematically in the spec.

agreed, there have been a succession of bug reports from me on various version sasking that this be clarified, it's better than it was but still not crystal clear I agree.

If it was made clear at the start that the document was well formed xml and conforming html, the additional rules were to get compatible parse trees then rules such as

Polyglot markup surrounds all attribute values with quotation marks. Polyglot markup surrounds attribute values by either single quotation marks or by double quotation marks. 

in section 7 could be removed, as this is implied by (but only a small part of) being well formed.


> 
> Thus, the spec should not mention <xmp> and <plaintext>. Or, if it mentions
> them, then it should make clear that they are invalid in HTML5 and that they
> are impossible to include in a DOM equal polyglot. (Well, xmp is possible, as
> long as one ignores the purpose of it in the first place.)

agreed these should go as if they are there the document isn't conforming

david

Comment 4 Leif Halvard Silli 2011-01-28 17:01:54 UTC

(In reply to comment #3)
> (In reply to comment #2)
> >(What I must admit I find a bit strange is that
> > the mere presence of a DTD, regardless of what the DTD says, would cause <p
> > id="666"/> to be invalid ...)
> 
> that would be strange, but is not the case, there are no special rules for
> attributes of name id in XML. If an attribute is declared of type ID (whatever
> its name) then some additional validity rules apply.

Good point. :) Thanks. May be that very point should also be made in the document ... !

> > Thus Polyglot Markup needs only to say that, as long as (or because/when) there
> > is no DTD, then XML-validity is not an issue. But I would not mind if it also
> > said what to remember when/if there *is* a DTD. After all, the goal is to have
> > an equal experience also in that circumstance.
> > 
> 
> > To say that Polyglot Markup describes a DOM-equal subset of conforming HTML and
> > well-formed XML, sounds like a good description of the principle(s). From that
> > definition it should be easy to understand what "HTML-compatible XHTML" means.
> > May be the spec should say that when it says "HTML-compatible" then it means
> > "DOM-equal".
> > 
> > The spec could then explain that the rules for conforming HTML are found in
> > HTML5. And also say that the DOM to which polyglot markup needs to adapt, is
> > also described in HTML5. But that the rules for well-formed XML are found in
> > XML. The spec could then, as you say, go on to discuss the consequenses of
> > these rules and principles.
> > 
> > I hope that this can be dealt with more systematically in the spec.
> 
> agreed, there have been a succession of bug reports from me on various version
> sasking that this be clarified, it's better than it was but still not crystal
> clear I agree.
> 
> If it was made clear at the start that the document was well formed xml and
> conforming html, the additional rules were to get compatible parse trees then
> rules such as
> 
> Polyglot markup surrounds all attribute values with quotation marks. Polyglot
> markup surrounds attribute values by either single quotation marks or by double
> quotation marks. 
> 
> in section 7 could be removed, as this is implied by (but only a small part of)
> being well formed.

Or, instead of being removed, it could be turned into/referred to as examples.

I do think it makes sense to show authors examples of what the principles means. And I don't mind a complete list of examples ... 

What I don't like is if the document is more like a list of things we do and things we don't do ... without a clear expression of the principles behind the list.

> > Thus, the spec should not mention <xmp> and <plaintext>. Or, if it mentions
> > them, then it should make clear that they are invalid in HTML5 and that they
> > are impossible to include in a DOM equal polyglot. (Well, xmp is possible, as
> > long as one ignores the purpose of it in the first place.)
> 
> agreed these should go as if they are there the document isn't conforming

If they could be turned into examples of something, then they could remain ...  They are not worth mentioning "out of the blue".  But if examples of an extreme degree of DOM-un-equality etc is needed ... they could fit it. In fact, they could serve as examples of both DOM-(un)equality  and HTML5-invalidity.

Comment 5 Leif Halvard Silli 2011-01-29 12:50:31 UTC

Following the discussion with David, I would reformulate and expand the my suggested principles section like so:

]] 
Section I: Principle and base rules

HTML-compatible XHTML documents are, syntactically, XML documents that are authored according to conditions that are set by the HTML DOM and scripted according the limitatations defined by XML and where the HTML-parser is triggered to use the most XML equivalent rendering mode (no-quirks mode) and the same CSS can be used in both XML-mode and HTML-mode. Thus HTML-compatibility means equivalence in the fields of DOM, CSS and scripting, irrespective of HTML-parsing or XML-parsing. Conformance (validity) of an HTML-compatible XHTML document is governed by the HTML-standard that the author has followed - this document examplifies how to create HTML5-conforming polyglot markup.

The above leads to the following sentences about what HTML-compatible XHTML is:

Polyglot Markup

1) is about how to replicate HTML's automatic DOM in XML;
2) follows a subset of well-formed XML where,
    HTML-conformance notwithstanding, it is the resulting 
    HTML DOM  which defines the XML-syntax rules. 
3) is scripted according to the rules of XML (no document.write)
4) triggers non-quirks mode in HTML parsers since this is most 
    equivalent to how XML mode rendering both with regard to
    DOM and CSS;
5) has some exceptions w.r.t. DOM-equivalence on attribute
    level due to some required XML namespace attributes.
6) rules out some HTML-elements because they are impossible
    to replicate in a XML parser;
7) results in the same encoding and the same language in both 
    HTML-mode and XML-mode.
8) is validated for conformance according to an applicable 
    HTML-standard - the HTML-conformance rules impacts
    on the DOM exceptionts  w.r.t. what inequality, that
    is tolerable.
9) does not not need to be XML-valid. XML-validity requires a
    DTD, but HTML (in particular HTML5) seeks to avoid DTDs
    as they have no effect in HTML-parsers. DTD-authoring advice.

<-- then I would outline those sentences/principles before, finally, describing HTML5-conforming polyglots: -->

== 1. Replicating HTML's automatic DOM in XML ==

Extra rules from HTLM's point of view - but also from XML's point of view: 
In HTML, it is permitted to drop lots of syntax - as it get autocreated in the DOM. In XML there is no such automation, thus the code must be written explicitly. Thus one must use the "</p>", one muste use <hmtl>, <body>, <head>, <colgroup> etc.  [Provide a list over the automated DOM-productions that HTML offers - this list can be updated as HTML6 is specced and so on.]

Extra rule from HTML's point of view: Attribute normalization belongs here. 

Links to relevant sections in XML1 and HTML5.

== 2. Subset of well-formed XML - governed by the resulting HTML-DOM  ==

Describe exceptions from XML's POV: when <foo/> can be  used and when <foo></foo> must be used. Etc. Without mixing conformance into the issue.

Descripe the (most important) extra rules from HTML's POV: escaping '<' and '&' etc.

== 3. Scripting ==

Document.write is forbidden - etc.

== 4. No-quirks mode ==

Only no-quirks doctypes are permitted. Or else the page is rendered differently in HTML vs XML. A no-quirks triggering doctype is also, for this reason, required (except inside the @srcdoc attribute). Also, say that in some legacy HTML-parsers, then <?xml version="1.0' ?> triggers quirks. The same also happens (in IE6,IE7,IE8) if there is a <!--comment-->before the DOCTYPE. no-quirks is an absolute requirement. If legacy user agents with such behavior is not an issue, then neither the XML declaration or such comments are a problem (however, HTML-conformance rules may forbid them).

== 5. Equality exceptions ==

xml:lang, xmlns etc are permitted despite that it results in a different DOM. Justification: required by XML. Unlese these differences were accepted, polyglots would not be possible.

== 6. Banning of some HTML elements ==

Some HTML-element can't be used in XML. E.g. Noscript, plaintext, etc.

== 7.  Internationalization ==

Polyglot Markup needs both xml:lang and lang, or else we get a language difference. Polyglot Markup should use UTF-8, for such and such reasons: can be detected by XML-parser, HTML5-conformance permits it and more. Polyglot Markup which isn't UTF-8 or UTF-16 could use <?xml version="1.0" encoding="ISO-8859-1" ?>, however this could lead to non-polyglottness (quirks-mode) in some legacy parsers as well as non-validity in HTML5 - if this is an issue, then - for non-UTF8/16 encodings - authors *must* use an external HTTP header to set the encoding. Polyglot Markup RECOMMENDS UTF-8.

== 8. Validation according to a HTML-standard ==

This specification does not say which HTML-standard to validate against, but defines general rules. However, HTML5 is the basis for our thinking. HTML5-validation is the only validation we are aware of which properly takes the DOM into account - other validation services, such as XHTML1.0 validation by W3C, is known for not taking into account the DOM. (That said, HTML5-validation follows many rules that are not at all related to the DOM.)

== 9. XML-validity ==

XML-validity is only an issue if the DOCTPE contains a DTD. Some advice about how to, eventually, author a DTD - say that @id should be CDATA and so on. Say that @id in a polyglot is CDATA, and thus not subject to XML 1.0's name production.

Section II: HTML5-specific examples

<!-- Here most of what is already in the spec can be used. -->

[[

Comment 6 Leif Halvard Silli 2011-01-30 00:42:42 UTC

(In reply to comment #5)
> == 7.  Internationalization ==

> Polyglot Markup should use UTF-8, for such and such reasons:

(A)  I now believe that the exact, permitted encodings should be a conformance issue.
   That way we can solve the issue that several - including Sam -  seems to be wanting to say that only UTF-8 should/must be used. (Plust that we solve _my_ problem: I think it would be stupid to say that polyglot markup, per definition, needs to be UTF-8. I'm fine with limiting HTML5-compatibel documetns to UTF-8, as long as the rule is founded in something solid.)

    Thus the principles section should only  give general consideration - e.g. it can say that if you sets the encoding with a meta element, then you must also set the encoding with XML declaration, except when the encoding is UTF-8  (and UTF-16).

    The status of the HTML5 spec is that it permits <meta charset="UTF-8"/> inside the XHTML syntax only when the value of @charset is UTF-8. And it also forbids the use of the use of  the XML declaration. Thus,  section 2, about HTML5-conformatance, should demand UTF-8.

(B) Regarding the general rules: we need to consider that HTML5/HTML parsers have encoding detection algorithm(s). Polyglot Markup must be authored in such a way that HTML5's encoding detection algorithm doesn't run (at least does not more more than to the step wher there is a <meta charset> element in the start of the doc). This rules needs to be in place in order to equalize both the DOMs and the general experience. If the algorithm runs longer than that, then, in HTML5, the page can be redrawn, the encoding change during the actual parsing  and so on. 

I would also like to add a 10th point to the principles:

== 10. Authoring equality ==

Polyglots should be possible to author using both HTML tools and XML tools. And authoring is, in this case, understood as working on a single file - not in a CMS but in a file system. 

The practical consequence of this is that if you use other encodings than UTF-8/UTF-16 and also if you don't use the BOM, then there *must* be  a encoding declarations inside the document. (This in turns, leads us to say that, for HTML5-conforming documents, then only UTF-8 (and perhasp UTF-16 - must think) is permitted.

Comment 7 David Carlisle 2011-01-30 18:16:28 UTC

> I prefer to have further debates about the scope of the document in bug 11909.

moving here as requested (but actually the mailing list is probably better)

> It is also not enough that the document is conforming HTML(5): <noscript> does
not work as intended in XML. 

of course, but the extra constraints needed for a document known to be conforming html and well formed xml to get compatible DOM parse are sufficiently coherent that one could contemplate listing them in full in a spec, hence this spec.

you seem to want to document the set of well formed documents that produce a compatible DOM if parsed as XML and that is a vastly more complicated set to describe.


> Isn't it a good start to just list them?

No, it would be at best misleading.

> Just create a header saying "the
following features are autogenerated if you don't insert them, and must
therefore be explicitly added for DOM compatibility". And then list the
features/elements. 

That's nowhere near close to a usable specification. The HTML parser doesn't just insert elements it moves things around in ways that are fully specified but that you can not specify here without duplicating much of the html5 parsing spec.

You'd have to specify all the ways in which p (and other) elements are auto-closed, all the ways in which form elements get moved around tables, all the html elements that force-close math and svg. You'd have to specify not to use image. The list is endless, and unless it was complete the end result would be that authors would be able to generate documents that complied with all the constraints in the polyglot spec, but which were not parsed in compatible ways by xml and html parsers.

If you restrict to conforming documents the complete set of constraints is more or less listed in the current document (there may be some bugs here and there but nothing that would make the document ten times bigger). If you do not restrict to conforming documents the downside is that the spec becomes unwritable and the only possible upside is that people are informed how to make non conforming documents using xml tools, but I don't see that making non conforming documents should be a valid use case.

Comment 8 Leif Halvard Silli 2011-01-30 20:33:31 UTC

(In reply to comment #7)

> you seem to want to document the set of well formed documents that produce a
> compatible DOM if parsed as XML and that is a vastly more complicated set to
> describe.

This bug is about the principles for polyglotness - and thus the principles for the polyglot spec. I think the principles should be documented. I have proposed 10 principles in Comment 5 and Comment 6. Some of those principles are clear that we agree about. Others are new, I think.

Do you agree with the principles? If yes, should we place them in the doc?

Comprehensive principles would be able to cover "the set of well formed documents that produce a compatible DOM if parsed as XML". So therefore you are against having/listing principles? 

I think principles can a) make the document clearer {that is: more _writeable_ as opposed to _unwritable_}, b) can help us make som decisions about what the document should say c) can help authors getting a 'polyglot mindset',

> > Isn't it a good start to just list them?
> 
> No, it would be at best misleading.

There is a tendency in some of the things you have said to simply reject having any examples at all - for fear that it doesn't become complete. E.g. you have not filed any bugs against the documentation of the requirement to escape tabs,linefeeds,carriage returns in attributes. But you immediately thought it to be a bad idea to put in the document that the & and < needs to be escaped.

I think it is the task fo this document to be much more complete than XHTML 1.0 Appendix C became.

A proven method for discerning between requirements and examples is to use the phrases "this is normative" and "this is not normative". For example:

This is normative: Polyglots are XML well-formed. 
This is not normative: Thus, for example, & and < must be escaped - except inside CDATA sections. See XML 1.0 for complete definition of well-formed.

> > Just create a header saying "the
> following features are autogenerated if you don't insert them, and must
> therefore be explicitly added for DOM compatibility". And then list the
> features/elements. 
> 
> That's nowhere near close to a usable specification.

I don't understand why it is not a useful list. But anyway, the most important thing for myself is to document the principles. If a list is too much, then don't have it. That said, the polyglot spec is currently full of lists.

> The HTML parser doesn't
> just insert elements it moves things around in ways that are fully specified
> but that you can not specify here without duplicating much of the html5 parsing
> spec.

I don't see that giving such a list does mean that you have to specify all that.  However, the most important thing for me is to list the principles, rather than particular lists of elements etc.

Of course, HTML5 itself contains many lists and categories. So e.g. if we want to say something general about block elements, then it is OK for me to point to HTML5 for a list of them.

> You'd have to specify all the ways in which p (and other) elements are
> auto-closed, all the ways in which form elements get moved around tables, all
> the html elements that force-close math and svg. You'd have to specify not to
> use image. The list is endless, and unless it was complete the end result would
> be that authors would be able to generate documents that complied with all the
> constraints in the polyglot spec, but which were not parsed in compatible ways
> by xml and html parsers.

I understand that your view is that if we say A, the must list the entire alphabet. It might also be that you are right  - that the task I had in mind, is too complicated.

Again, the most important thing for myself is to document the principles, rather than lists.

> If you restrict to conforming documents the complete set of constraints is more
> or less listed in the current document (there may be some bugs here and there
> but nothing that would make the document ten times bigger). If you do not
> restrict to conforming documents the downside is that the spec becomes
> unwritable and the only possible upside is that people are informed how to make
> non conforming documents using xml tools, but I don't see that making non
> conforming documents should be a valid use case.

One problem that we must deal with is the fact that what a conforming HTML document is, is a moving target. I will also remind you that HTML5 has the concept of "applicable specification" which can e.g. add other elements and namespace prefixes. Or do you really, honestly, want that only documents that conform to HTML5 "proper", can ever earn the right to be called polyglot? 

It is also so that authors would want to use polyglot markup in order to achive the benefits of doing so. Thus it is not *only* about meeting some formal requirement. So, for example, if a document gets quirks-mode parsing in IE6-9 because the author inserted <!--comments--> before <!DOCTYPE html>, then that author has got a practical problem. 

So far the document doesn't speak about quirks mode. But I suggest, as one of the principles for polyglot markup, that it leads to no-quirks mode. Not to cover any set of documents. But to document and incorporate no-quirks mode into the polyglot concept.

Comment 9 Leif Halvard Silli 2011-02-01 06:13:40 UTC

Follwoing off-list discussion with David, I would suggest the HTML-conformance requirement to go like this:

]]
	Polyglot mark-up conforms to a polyglot (as understood by 
	the rest of these principles [DOM-compatible, no-quirks etc]) subset
	of one of the following:

	1)	the text/html syntax of the HTML5 spec;
	2)	the text/html syntax of the HTML5 spec + applicable spec(s);[1]
	3)	XML schema-based subsets of HTML5 (example: XHTML1.0); [2] 

	NOTE [2]: XML schema-based supersets or additions the text/html
			  syntax or to subsets of the text/html HTML5 syntax falls
			  in under 2).
	NOTE [1]: the polyglot requirements must be met: "/>" must be 
			  limited to void elements per the HTML5 text/html syntax and
			  all polyglot requirements are followed;
[[

Those principles would, 

	for 1), place all non-conforming HTML5 features in the cold;
			Goodbye to <xmp>, <plaintext>, on conformance grounds;
			Goodbye to <noscript> on polyglotness grounds (lack
			of DOM-compatibility);
	for 2), same as as 1) but would allow whatever the applicable spec
			permits, as long as it is "polyglot" - e.g. DOM-compatible);
	for 3), same as 1) except that some features forbidden by HTML5 would
			be permitted - e.g. the @compact attribute (and other attributes
			that HTML5 forbids but which XHTML1 may permit) while some features
			that HTML5 permits would be forbidden, such as the video element.

Comment 10 Leif Halvard Silli 2011-02-01 06:16:12 UTC

I would suggest the DOCTYPE princple to be expresss roughly like this:

]] Enumerating the DOCTYPE use options:
* All DOCTYPEs that the HTML5 text/html syntax rules considers 
  conforming (<!DOCTYPE html SYSTEM="about:legacy-compat"> and 
  <!DOCTYPE html> and, inside @srcdoc, no DOCTYPE.)
* Any _XHTML_ DOCTYPEs that HTML5 text/html syntax defines as 
  obsolete but conforming;
* Any _XHTML_ DOCTYPE that, following HTML5's text/html parsing 
  rules, results in no-quirks mode. [[

All the above doctypes are considered polyglot.

Comment 11 Eliot Graff 2011-02-12 00:49:25 UTC

Based on this discussion, I have added the following principles to the Introduction section as of the 11 February 2011 Editor's Draft:

]]
Polyglot markup: 
is valid HTML5. [HTML5]
is well-formed XML. [XML10]
results in identical DOMs (with the exception of the xmlns attribute on the root element) when processed as HTML and when processed as XML.

Polyglot markup is not constrained: 
to be valid XML. [XML10]
by conformance to any XML DTD.
[[

Thank you so very much. This has been extraordinarily helpful in defining scope for polyglot markup.

Best,

Eliot

Comment 12 Leif Halvard Silli 2011-02-13 19:47:23 UTC

I am quite satisified - this should grasp the essence of the debate. Well done!

But, there are 2 nitpicking points to be made:

1) You wrote 'HMTL' instead of 'HTML': "strong desire to serve both HMTL and XML tool chains"

2) You write '(with the exception of the xmlns attribute on the root element)'
     Could you rephrase that parenthesis to express that xmlns is not only permitted on the root element? As you know, it is also required for SVG, MATHML etc.
     May be the very simplest change would be to say "a root element" instead of "the root element". However, some wil perhaps think that "root" = <html>

Comment 13 Leif Halvard Silli 2011-02-14 11:32:00 UTC

(In reply to comment #12)

> 2) You write '(with the exception of the xmlns attribute on the root element)'
>      Could you rephrase that parenthesis to express that xmlns is not only
> permitted on the root element? As you know, it is also required for SVG, MATHML
> etc.
>      May be the very simplest change would be to say "a root element" instead
> of "the root element". However, some wil perhaps think that "root" = <html>

However, there are some additional exceptions, such as 'xml:lang'. Perhaps you could state the exception as a more general principle, and also, outside a parenthesis? 

Thus, I hereby suggest that you replace:
        ]]* results in identical DOMs (with the exception of the xmlns attribute on the root element) when processed as HTML and when processed as XML[[

With something like the following:
      ]]* results in identical DOMs, with the exception of some XML (xml:lang, xml:space and xml:base), XMLNS (xmlns="" and xmlns:xlink="") and XLINK (such as xlink:href) attributes that which XML requires and which HTML5 permits in certain locations and which, nevertheless, are preserved by HTML parsers[[

Comment 14 Leif Halvard Silli 2011-02-14 12:27:30 UTC

(In reply to comment #11)

> ]]
> Polyglot markup: 
> is valid HTML5. [HTML5]
> is well-formed XML. [XML10]

I suggest to replace "is valid HTMl5. [HTML5]" with: 
    ]] is _a_ valid _HTML_ _document_. [HTML5] [[

Because:

a) HTML only exists as "documents" - required elements that you don't type are generated in the DOM.
b) saying "HTML document" rather than "HTML5 document" has a little bit wider perspective. Also, HTML5 defines _HTML_. The link - [HTML5] - defines what the document means by "HTML".

Likewise, I also suggest to replace the "is well-formed XML" sentence with the following (i.e. I suggest to add the word 'documetn'):
   ]] is _a_ well-formed XML _document_. [XML10] [[

Because

a) to make it congruent with my "is a valid HTML document" proposal
b) the XML 1.0 link actually leads to section in XML 1.0 that bears the title "2.1 Well-Formed XML Documents"

Comment: In a way, the DOM-equality goal is a result of the requirements to - at the same time - be a "valid HTML document" and a "well-formed XML document".

Comment 15 Leif Halvard Silli 2011-02-14 12:56:59 UTC

(In reply to comment #11)

Another suggestion: could you examplify some of the consequence of the "indelcal DOMs" principle? In a way, I think you already expamplify some of the consequences in the paragraph which starts with "All web content need not be authored in polyglot markup."  But you could add more.

Looking at the 10 principles I formulated in Comment 5 and Comment 6, several of them could be described as consequences of the "identical DOMs" principle. And my proposal is that you mention some excerpts of those principle as examples of what the 'identical DOM' principle leads to:  

3) scripted according to the rules of XML (no document.write)
4) triggers non-quirks mode in HTML parsers - since quirks doesn't exist in XML;
6) rules outHTML-elementsimpossible to replicate inXML
     // As well as REQUIRES some element which HTML allows you to skip. //
7) results insame encoding andsame language inHTMLandXML
10) Authoring equality/"file URL parsing equality" = only UTF-8 and UTF-16 are RECOMMENDED (= SHOULD in per RFC language).

You could also add that some, but not all, of these consequnces are are taken care of if the document is valid HTML5. 

The reason why I think the above selection of 'identical DOMs' consequences should be mentioned is that they are obvious as well as non-obvious consequences.

Comment 16 Eliot Graff 2011-03-03 21:27:56 UTC

The part of the Introduction that contains the principles of polyglot now reads as follows in the 3 March Editor's Draft:

]]
Polyglot markup results in: 
 a valid HTML document. [HTML5]
 a well-formed XML document. [XML10] 
 identical DOMs when processed as HTML and when processed as XML. A noteable exception to this is that HTML and XML parsers generate different DOMs for some xml (xml:lang, xml:space, and xml:base), xmlns (xmlns="" and xmlns:xlink=""), and xlink (such as xlink:href) attributes. XML requires and HTML5 permits these attributes in certain locations and the attributes are preserved by HTML parsers. 

Polyglot markup is not constrained: 
 to be valid XML. [XML10] 
 by conformance to any XML DTD.

Polyglot markup is scripted according to the rules of XML (does not use document.write, for example) and excludes HTML elements that are impossible to replicate in an XML parser (does not use the <noscript> element, for example). Polyglot markup triggers non-quirks mode in HTML parsers, as non-quirks mode is closest to XML-mode rendering, in regard to both DOM and CSS. Polyglot markup results in the same encoding and the same language in both HTML-mode and XML-mode. 
[[

I think that this covers the requests below. Thanks, once more, for your help!

Eliot

Comment 17 Leif Halvard Silli 2011-03-03 23:33:37 UTC

Very good. Especially happy that you include the thing about non-quirks mode! I'm closing this issue!

Comment 18 Michael[tm] Smith 2011-08-04 05:06:51 UTC

mass-move component to LC1

Comment 19 Michael[tm] Smith 2011-08-04 05:07:14 UTC

mass-move component to LC1