Bugzilla – Bug 17710
Polyglot Markup: Remove XML validity completely
Last modified: 2013-04-08 21:39:38 UTC
Section 4. The DOCTYPE requires that, quote:
]] * The string html is in lowercase letters. [[
And for a number of reasons, this is a good rule to have. However:
* To require lowercase 'html' is a XML validity constraint:
* Polyglot Markup, however, currently only operates with
an XML well-formed principle - it has no XML validity principle:
* To do <!DOCTYPE HTML> or <!DOCTYPE hTmL> (as opposed to
<!DOCTYPE html>, is NOT a well-formedness violation - it is thus
NOT a fatal error in XML.
* The HTML5 validator already accepts uppercase 'HTML' for documents
served as application/xhtml+xml: http://goo.gl/hbZvC
Add a new principle in the Introduction section stating that, in addition to the other constraints (XML well-formedness and HTML-compatibility), Polyglot Markup complies with all the XML validity constraints of the DOCTYPE and the DTD of the document. This should be a MUST principle, as such a thing would favour the use of the HTML5 doctype, due to its much simpler XML validity requirements. Though I can also live very nicely with a SHOULD principle.
Benefits of this proposal:
(1) It favors the use of the HTML5 DOCTYPE, since the HTML5 doctype does not add any validity constraints - except the constraint that the DOCTYPE itself must contain the 'html' string in lowercase. (Polyglot Markup does not rule out other doctypes than the HTML5 doctype.)
(2) It makes Polyglot Markup a more universal specification, that applies even to - for example - XHTML 1.0 documents (which are considered 'obsolete but conforming' by HTML5.)
(3) It makes the spec more logical. After all, we cannot ignore the fact that, merely to have a DOCTYPE, even a simple doctype as the HTML5 doctype, DOES introduce the concept of XML validity into HTML5.
(4) For someone using a XML toolchain to create polyglot HTML, they can more easily understand how the concept of XML validity plays into Polyglot Markup. E.g. for a document with the HTML5 doctype, an XML validity check would only potentially produce a single error (wrong casing of the 'html' string in the DOCTYPE). On the other side: Many of the XML validity concepts that relates to XHTML 1.0 and XHTML 1.1 are relevant for HTML5 too. (For instance, the requirement that @id attributes must be unique).
If we do not introduce the XML validity constraint, we need to take one
of the following to actions instead:
ALT 1: Turn the requirement to use lowercase 'html' into a
informational note about how to cater for validating XML
"To cater for validating XML processors, the string
html should be in lowercase."
ALT 2: Delete the entire requirement that 'html' has to be lowercase.
Leave it all to XML and HTML5.
(In reply to comment #0)
> Section 4. The DOCTYPE requires that, quote:
> ]] * The string html is in lowercase letters. [[
> Benefits of this proposal:
> (1) It favors the use of the HTML5 DOCTYPE, since the HTML5 doctype does
> not add any validity constraints - except the constraint that the DOCTYPE
> itself must contain the 'html' string in lowercase.
The above is not technically correct. If you use
Then he document is necessarily invalid according to the XML definition of validity. It is not accurate to say that this DOCTYPE adds no constraints, it is more accurate to say that it declares no elements, so any use of any element is invalid.
This is why it is important only to depend on XML well-formedness and not XML validity when defining polyglot documents.
You are correct however that the requirement that "html" be in lowercase is on consistent with the stated aims as it is neither required for the document to be XML well formed nor for it to be conforming HTML5. I suggest this requirement be dropped. (your ALT 2)
(In reply to comment #1)
> <!DOCTYPE html>
[ ... ]
> It is not accurate to say that this DOCTYPE adds no constraints, it
> is more accurate to say that it declares no elements, so any use of
> any element is invalid.
He, he ... good point. Yeah, in that case then there should be no point in focusing on the casing of the 'html' string inside <!DOCTYPE html> since, if the validating XML parser does not whine about the DOCTYPE, in will - anyhow - whine about undeclared elements ...
> This is why it is important only to depend on XML well-formedness and not XML
> validity when defining polyglot documents.
Yeah, it shows that the DOCTYPE is only for HTML compatibility - and not for XML compatibility …
> You are correct however that the requirement that "html" be in lowercase is on
> consistent with the stated aims as it is neither required for the document to
> be XML well formed nor for it to be conforming HTML5. I suggest this
> requirement be dropped. (your ALT 2)
Yes, that seems like the right conclusion then. I changed the title of the bug, accordingly.
Btw, are you 100% that there is no trouble for someone with an XML toolchain if the casing of the 'html' string of the DOCTYPE is 'wrong'?
(In reply to comment #2)
> (In reply to comment #1)
> Btw, are you 100% that there is no trouble for someone with an XML toolchain if
> the casing of the 'html' string of the DOCTYPE is 'wrong'?
The resulting file is well formed but not valid. Whether that causes problems depends on what you are doing. But most polyglot documents are likely to be well formed but not valid.
(In reply to comment #3)
I suggest that the editor just keeps the text as is, but adds, in a parenthesis - or via some other method - that the casing does not impact on the validity or well-formedness:
]] The string html is in lowercase letters. (RECOMMENDED, but not REQUIRED.) [[
Justificaiton: I think it is good to say that the casing does not matter - this removes one more 'XML-is-so-difficult" misconception. But on the other hand, it is also complication to send the message that "you can do it in 100 different ways". Additionally - and unlike for text/html - it is a well-formed constraint for XML that the string DOCTYPE (and when used, the string SYSTEM and PUBLIC) is uppercase. Hence, it is important to not create the impression that on can do <!doctype html>.
(In other words: <!DOCTYPE HTML> would be well-formed but not valid, while <!doctype html> would not be well-formed.)
Removed the requirement for lowercase "html" and added the following note in Section 4:
The string html SHOULD be in lowercase letters, in order to be both well-formed and valid XML; however, the string MAY be in mixed case or uppercase letters and still be well-formed XML.
new revision: 1.94; previous revision: 1.93
Resolving as fixed.
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If
you are satisfied with this response, please change the state of
this bug to CLOSED. If you have additional information and would
like the Editor to reconsider, please reopen this bug. If you would
like to escalate the issue to the full HTML Working Group, please
add the TrackerRequest keyword to this bug, and suggest title and
text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this
Change Description: See comment 5
(In reply to comment #5)
Hi Eliot. If there is no DTD, even lowercase fails to be valid. But we have not principally ruled out a DTD, I think, so I guess this is OK.