11064 – unstated requirement to be valid.

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 11064 - unstated requirement to be valid.

Summary: unstated requirement to be valid.

Status:	RESOLVED NEEDSINFO

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	LC1 HTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff) (show other bugs)
Version:	unspecified
Hardware:	PC Windows NT

Importance:	P2 normal
Target Milestone:	---
Assignee:	Eliot Graff
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-10-15 09:02 UTC by David Carlisle
Modified:	2011-08-04 05:07 UTC (History)
CC List:	6 users (show)

See Also:

Attachments

Description David Carlisle 2010-10-15 09:02:15 UTC

There is an implicit requirement in the document that in order to comply with polyglot markup the document must be valid to some DTD. If the document is not valid (or not well formed) then the DOMs generated by the html and xml parsers would differ in many ways not listed here, due to Adoption Agency Agency algorithm, and other special case handling done by the html parser.

On the other hand, if the there was an explicit requirement in the specification that the document be conformant to some specified xml dtd matching html+mathml+svg then many parts of the document could be removed.

* most of section 4 would be redundant (eg the document is not well formed XML if <!DOCTYPE is lowercase

* all of section 5 would be redundant given a suitable xml dtd

* 6.1 would be redundant if the xhtml dtd did not allow tr as a child of table and instead required tbody (this would be a change from xhtml1 dtd, but in line with other changes needed for html5)

6.2 and 6.3  could be removed

7.1 could be removed

8 could be removed if the dtd that was referenced at the start did not define the html/mathml entities, as then any use of such an entity would render the document not well formed.



Given the constraints that the doctype should preferably be just 
<!DOCTYPE html>
some care in the wording has to be used if there were a requirement that the document be valid to a particular dtd, however it is easy enough to say this in words or you could say that there was an implied xml catalog that substituted the required dtd whatever dtd was specified in the doctype declaration. However I think that the document would be a lot more accurate (and easier to keep in line with multiple versions of svg) if requirements such as the right case of element and attribute names were deferred to a machine check-able xml dtd rather than being in fragile manually maintained lists in the text of a specification.

Comment 1 Henri Sivonen 2010-10-15 13:48:43 UTC

(In reply to comment #0)
> There is an implicit requirement in the document that in order to comply with
> polyglot markup the document must be valid to some DTD.

No, the requirement is for the document to be valid HTML5, but that doesn't involve a DTD.

> On the other hand, if the there was an explicit requirement in the
> specification that the document be conformant to some specified xml dtd
> matching html+mathml+svg then many parts of the document could be removed.

I'd object to defining any HTML WG deliverable in terms of DTD-validity for reasons stated in http://lists.w3.org/Archives/Public/public-html/2007Apr/0799.html

Comment 2 David Carlisle 2010-10-15 14:15:02 UTC

(In reply to comment #1)

> No, the requirement is for the document to be valid HTML5,

well it doesn't say that (as clearly as I'd like) either.

>  but that doesn't involve a DTD.

No particular preference to DTD here I just wrote "DTD" out of habit.
A RelaxNG schema would do the job as well. Anything that can check the xml side of the equation mechanically.

> I'd object to defining any HTML WG deliverable in terms of DTD-validity for
> reasons stated in
> http://lists.w3.org/Archives/Public/public-html/2007Apr/0799.html

As I say, RelaxNG is probably better than DTD for this anyway. But for many of the requirements given as requirements for "polyglot" documents, even just a non validating xml parser would suffice. The requirement that <!DOCTYPE be uppercase (section 4) or that attribute values be surrounded by quotes (section 7) are just a requirement that the document be well formed XML.

It seems to me that the interesting part of this spec ought to be the _extra_ requirements (to get compatible DOM) that a document needs to have given that it is valid according to the html spec and (say) valid XML according to some specified RelaxNG schema.

Even things like the white space in <pre> rules in section 6.5 could be put in a Relax Schema

which leaves (I think) more or less, just  extra rules about where you can use /> syntax and special cases for <script> and (if you use relax rather than DTD) restrictions on the use of namespace declarations and prefixes that would not be visible at the RelaxNG layer.

Comment 3 Tab Atkins Jr. 2010-10-15 15:49:39 UTC

(In reply to comment #2)
> (In reply to comment #1)
> 
> > No, the requirement is for the document to be valid HTML5,
> 
> well it doesn't say that (as clearly as I'd like) either.

An HTML5 polyglot document is still an HTML5 document.  This specification does not purport to change conformance requirements; indeed, the specification is entirely non-normative.

Thus, the requirements imposed by the HTML5 spec still apply.

Comment 4 David Carlisle 2010-10-15 16:15:18 UTC

(In reply to comment #3)
clearly as I'd like) either.
> 
> An HTML5 polyglot document is still an HTML5 document.  This specification does
> not purport to change conformance requirements; indeed, the specification is
> entirely non-normative.
> 
> Thus, the requirements imposed by the HTML5 spec still apply.

yes but which requirements? HTML5 (as you know:-) specifies a parse tree for any input, not just valid input. This spec purports to say that if you follow the advice here then you will get the same DOM in text/html and xml, which clearly isn't the case for non valid input. It wouldn't hurt to say that there is an assumption that the input is valid.

then even if it is valid it may make (say) a lot of use of implied end tags, and if it does that you won't get the same DOM (or any DOM at all) from an XML parser, but this spec does not tell you that. So there is an apparent assumption that this specification is discussing valid html5 documents that are (at least) well formed XML, and then giving the further requirements needed to get compatible DOM. But if the input is assumed well formed, why give requirements such as uppercase <!DOCTYPE or quotes around attribute values?

Comment 5 Henri Sivonen 2010-10-18 11:08:01 UTC

(In reply to comment #4)
> > Thus, the requirements imposed by the HTML5 spec still apply.
> 
> yes but which requirements?

All the document conformance requirements. That is, all the normative statements about document or authors.

> It wouldn't hurt to say that there
> is an assumption that the input is valid.

Agreed.

Comment 6 David Carlisle 2010-10-18 11:23:57 UTC

(In reply to comment #5)
yes but which requirements?
> 
> All the document conformance requirements. That is, all the normative
> statements about document or authors.

OK so if the polyglot spec aims to say, given a conformant html5 document, what further constraints the document must satisfy in order to get a compatible DOM if parsed as XML then

a) it needs to say that, and
b)it needs to add extra constraints to explain why

http://monet.nag.co.uk/~dpc/poly/t1.html

does not produce the same DOM from an html5 and XML parser. As far as I can see it is conformant html5 (validator.nu agrees) and it appears to meet all the constraints in this spec, but you don't get an XML DOM as t1.html isn't well formed as XML.

If the intention is that the constraints only apply to conformant html5 documents that are well formed XML then
a) the polyglot spec should say that, and
b) if it did say that, then  many of the constraints that are currently specified (such as quoting attribute values, and using uppercase <!DOCTYPE) are redundant.

Comment 7 Eliot Graff 2010-10-29 20:04:17 UTC

The spec currently says:

(from the Abstract) 
A document that uses polyglot markup is document that is a stream of bytes that parses into identical document trees (with the exception of the xmlns attribute on the root element) when processed as HTML and when processed as XML. Polyglot markup that meets a well defined set of constraints is interpreted as compatible, regardless of whether they are processed as HTML or as XHTML, per the HTML5 specification. 

(from the 2nd paragraph of SOTD)
This document summarizes design guidelines for authors who wish their XHTML or HTML documents to validate on either HTML or XML parsers, assuming the parsers to be HTML5-compliant. This specification is intended to be used by web authors. It is not a specification for user agents and creates no obligations on user agents. Note that this recommendation does not define how HTML5-conforming user agents should process HTML documents. Nor does it define the meaning of the Internet Media Type text/html. 


Please provide further info about how youd like this changed.

I appreciate your feedback,

Eliot

Comment 8 David Carlisle 2010-10-29 21:25:55 UTC

(In reply to comment #7)

> 
> Please provide further info about how youd like this changed.
> 
> I appreciate your feedback,
> 
> Eliot

I have given some explict suggestions in earlier comments.

I believe that the document should say explicitly that it is documenting _extra_ constraints that must be satisfied by a document that is well formed XML and valid HTML5  in order that compatible DOM are generated by an XML or HTML5 parser.

then the document should restrict itself to those extra constraints and remove all sections that only describe requirements of being xml well formed.

An alternative, as described in the original comment would be to assume the document is both valid as html5 and valid to some xml schema, in which case the majority of the document could be removed.

(I could understand a "needsinfo" classification, but clearly the issue isn't resolved, so I re-opened.

Comment 9 Eliot Graff 2010-10-30 00:41:57 UTC

> I believe that the document should say explicitly that it is documenting
> _extra_ constraints that must be satisfied by a document that is well formed
> XML and valid HTML5  in order that compatible DOM are generated by an XML or
> HTML5 parser.
> 
> then the document should restrict itself to those extra constraints and remove
> all sections that only describe requirements of being xml well formed.
> 
> An alternative, as described in the original comment would be to assume the
> document is both valid as html5 and valid to some xml schema, in which case the
> majority of the document could be removed.

I will work on language to indicate that polyglot involves the extra constraints of satisfying HTML and XML parsers.

> 
> (I could understand a "needsinfo" classification, but clearly the issue isn't
> resolved, so I re-opened.

I apologize. I meant just to toss the bug back to you for more info, but I could not get bugzilla to enable the MORE INFO field unless I also selected RESOLVED.

Thanks,

Eliot

Comment 10 Eliot Graff 2010-12-17 23:16:44 UTC

Hi David.

Reviewing this bug, I think I am finally getting that your desire is to have a DTD created that would encompass that subset of HTML5 (+mathml+svg) that intersects with valid XML and that is also compliant with the goal of polyglot markup: creating identical document trees (with the exception of the xmlns attribute on the root element) when processed as HTML and when processed as XML. However, I do not have the time or wherewithal to create such a DTD. This guide serves, in essence, as a description of the practices to adhere to and avoid that would be enforced in such a DTD. 

If we had a person or team who could create such as DTD, I think that would be fantastic. I would still want to keep the descriptions in this guide in addition to that resource, though. This guide enables authors to see--at a glance--what the rules are. They would then not have to parse the entire DTD to infer those things that are unique to polyglot markup.

If you are willing to undertake the creation of such a resource, I would welcome it, but, as I mentioned, I am not in a position to create it myself. I'll resolve this as NEEDS INFO, in case you wish to do the work to create the DTD and present it.

Thanks so much,

Eliot

Comment 11 David Carlisle 2010-12-17 23:33:26 UTC

(In reply to comment #10)
> Hi David.
> 
> Reviewing this bug, I think I am finally getting that your desire is to have a
> DTD created that would encompass that subset of HTML5 (+mathml+svg) that
> intersects with valid XML and that is also compliant with the goal of polyglot
> markup: creating identical document trees (with the exception of the xmlns
> attribute on the root element) when processed as HTML and when processed as
> XML. However, I do not have the time or wherewithal to create such a DTD. This
> guide serves, in essence, as a description of the practices to adhere to and
> avoid that would be enforced in such a DTD. 
> 

> If you are willing to undertake the creation of such a resource,

Comment 12 David Carlisle 2010-12-17 23:52:01 UTC

(In reply to comment #10)
> Hi David.
> 
> Reviewing this bug, I think I am finally getting that your desire is to have a
> DTD created that would encompass that subset of HTML5 (+mathml+svg) that
> intersects with valid XML and that is also compliant with the goal of polyglot
> markup:

partly, but actually it's that because the scope of the document is unclear it makes it very hard to review whether the constraints specified are sufficient to achieve the stated aim.

If the intention is to document the extra constraints that are needed for a valid html5 document to be a well formed xml document that parses to an equivalent dom then the specification does not give enough constraints.
For example it does not say that end tags must be explicit.
<p>aaa<p>bbb
is valid html5 but must be marked as
<p>aaa</p><p>bbb</p>
to be well formed XML.

Conversely, if the intention is to document the extra constraints that are needed for a valid html5 document that is well formed XML to satisfy so that the html and xml dom trees are compatible then there are many redundant rules given in the spec, for example saying that attributes be quoted, or doctype be uppercase.

I think that the document should be explicit which of these it is intending.
I would recommend the latter.


> 
> If we had a person or team who could create such as DTD, I think that would be
> fantastic. I would still want to keep the descriptions in this guide in
> addition to that resource, though. This guide enables authors to see--at a
> glance--what the rules are. 

I think that a mechanical representation of the rules (perhaps better in schematron or relaxng than dtd) would be a lot shorter than the document, so actually easier to see at a glance.

> They would then not have to parse the entire DTD to
> infer those things that are unique to polyglot markup.

I would only write a grammar that specified the extra constraints that a document known to be valid html5 and well formed xml needs to satisfy.
There is no need to duplicate the validation of the whole of html5.

> 
> If you are willing to undertake the creation of such a resource, 

I'm not willing to commit to providing one, but I might, I have some holiday coming up....

Comment 13 David Carlisle 2010-12-20 02:13:01 UTC

(In reply to comment #10)

> If you are willing to undertake the creation of such a resource, I would
> welcome it, but, as I mentioned, I am not in a position to create it myself.
> I'll resolve this as NEEDS INFO, in case you wish to do the work to create the
> DTD and present it.
> 

see

http://lists.w3.org/Archives/Public/public-html/2010Dec/0172.html

Comment 14 Henri Sivonen 2010-12-20 11:48:46 UTC

(In reply to comment #12)
> I think that a mechanical representation of the rules (perhaps better in
> schematron or relaxng than dtd) would be a lot shorter than the document, so
> actually easier to see at a glance.

I think this is not a good idea, because polyglotness includes lower-level constraints that can't be expressed in Schematron or RELAX NG. For example, the requirement not to have an XML declaration and the requirement to have one of the few doctypes can't be represented in Schematron or RELAX NG.

Comment 15 David Carlisle 2010-12-20 12:15:36 UTC

(In reply to comment #14)
> (In reply to comment #12)
> > I think that a mechanical representation of the rules (perhaps better in
> > schematron or relaxng than dtd) would be a lot shorter than the document, so
> > actually easier to see at a glance.
> 
> I think this is not a good idea, because polyglotness includes lower-level
> constraints that can't be expressed in Schematron or RELAX NG. For example, the
> requirement not to have an XML declaration and the requirement to have one of
> the few doctypes can't be represented in Schematron or RELAX NG.

the posted schematron _only_ tries to check the constraints needed to ensure that a conforming html5 document that is well formed xml is polyglot.

If either of the things you mention are wrong, then the document isn't conforming html5, so fails the preconditions for using this. (Basically if the document hasn't already been cleared by validator.nu, it shouldn't be used with this schematron)

However there are a couple of things that you can't check as noted in my email to the list, and also checking that entities are not used requires using an xml parser that doesn't read the dtd, so it might yet be better to have some custom code that checks these things.  However I think having such a mechanical check helps to distinguish between those requirements that are needed for the document to be valid html5, those that are needed for the document to be well formed XML, and those that are needed for the document to be polyglot.

The schematron _only_ tries to check the last of these. The polyglot document currently has requirements of all three types, but doesn't state which category each requirement is in.

Comment 16 Michael[tm] Smith 2011-08-04 05:07:20 UTC

mass-move component to LC1

Comment 17 Michael[tm] Smith 2011-08-04 05:07:41 UTC

mass-move component to LC1