Bug 17710 - Polyglot Markup: Remove XML validity completely
Summary: Polyglot Markup: Remove XML validity completely
Status: CLOSED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: HTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff) (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: ---
Assignee: Eliot Graff
QA Contact: HTML WG Bugzilla archive list
URL: http://dev.w3.org/html5/html-xhtml-au...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-07 05:00 UTC by Leif Halvard Silli
Modified: 2013-04-08 21:39 UTC (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Leif Halvard Silli 2012-07-07 05:00:48 UTC
Section 4. The DOCTYPE requires that, quote:

          ]] * The string html is in lowercase letters.  [[

   And for a number of reasons, this is a good rule to have. However:

PROBLEMS: 
 * To require lowercase 'html' is a XML validity constraint:
   http://www.w3.org/TR/xml/#vc-roottype
 * Polyglot Markup, however, currently only operates  with 
   an XML well-formed principle - it has no XML validity principle:
   http://dev.w3.org/html5/html-xhtml-author-guide/#introduction

EXAMPLES: 
* To do <!DOCTYPE HTML> or <!DOCTYPE hTmL> (as opposed to 
   <!DOCTYPE html>, is NOT a well-formedness violation - it is thus
    NOT a fatal error in XML.
* The HTML5 validator already accepts uppercase 'HTML' for documents 
   served as application/xhtml+xml: http://goo.gl/hbZvC

PROPOSAL:
   Add a new principle in the Introduction section stating that, in addition to the other constraints (XML well-formedness and HTML-compatibility), Polyglot Markup complies with all the XML validity constraints of the DOCTYPE and the DTD of the document. This should be a MUST principle, as such a thing would favour the use of the HTML5 doctype, due to its much simpler XML validity requirements. Though I can also live very nicely with a SHOULD principle.
 
  Benefits of this proposal: 

    (1) It favors the use of the HTML5 DOCTYPE, since the HTML5 doctype does not add any validity constraints - except the constraint that the DOCTYPE itself must contain the 'html' string in lowercase. (Polyglot Markup does not rule out other doctypes than the HTML5 doctype.)
   (2) It makes Polyglot Markup a more universal specification, that applies even to - for example - XHTML 1.0 documents (which are considered 'obsolete but conforming' by HTML5.)
   (3) It makes the spec more logical. After all, we cannot ignore the fact that, merely to have a DOCTYPE, even a simple doctype as the HTML5 doctype, DOES introduce the concept of XML validity into HTML5.
   (4) For someone using a XML toolchain to create polyglot HTML, they can more easily understand how the concept of XML validity plays into Polyglot Markup. E.g. for a document with the HTML5 doctype, an XML validity check would only potentially produce a single error (wrong casing of the 'html' string in the DOCTYPE). On the other side: Many of the XML validity concepts that relates to XHTML 1.0 and XHTML 1.1 are relevant for HTML5 too. (For instance, the requirement that @id attributes must be unique). 

ALTERNATIVE PROPOSALS:

   If we do not introduce the XML validity constraint, we need to take one 
   of the following to actions instead:

   ALT 1: Turn the requirement to use lowercase 'html' into a
          informational note about how to cater for validating XML
          processors:
          "To cater for validating XML processors, the string
           html should be in lowercase." 

   ALT 2: Delete the entire requirement that 'html' has to be lowercase.
           Leave it all to XML and HTML5.
Comment 1 David Carlisle 2012-07-10 12:53:00 UTC
(In reply to comment #0)
> Section 4. The DOCTYPE requires that, quote:
> 
>           ]] * The string html is in lowercase letters.  [[
> 

>   Benefits of this proposal: 
> 
>     (1) It favors the use of the HTML5 DOCTYPE, since the HTML5 doctype does
> not add any validity constraints - except the constraint that the DOCTYPE
> itself must contain the 'html' string in lowercase.



The above is not technically correct. If you use 

<!DOCTYPE html>
<html>
...



Then he document is necessarily invalid according to the XML definition of validity. It is not accurate to say that this DOCTYPE adds no constraints, it is more accurate to say that it declares no elements, so any use of any element is invalid.



This is why it is important only to depend on XML well-formedness and not XML validity when defining polyglot documents.



You are correct however that the requirement that "html" be in lowercase is on consistent with the stated aims as it is neither required for the document to be XML well formed nor for it to be conforming HTML5. I suggest this requirement be dropped. (your ALT 2)
Comment 2 Leif Halvard Silli 2012-07-10 14:21:47 UTC
(In reply to comment #1)
 
> <!DOCTYPE html>
  [ ... ] 
> It is not accurate to say that this DOCTYPE adds no constraints, it
> is more accurate to say that it declares no elements, so any use of 
> any element is invalid.

He, he ... good point. Yeah, in that case then there should be no point in focusing on the casing of the  'html' string inside <!DOCTYPE html> since, if the validating XML parser does not whine about the DOCTYPE, in will - anyhow - whine about undeclared elements ...

> This is why it is important only to depend on XML well-formedness and not XML
> validity when defining polyglot documents.

Yeah, it shows that the DOCTYPE is only for HTML compatibility - and not for XML compatibility … 
 
> You are correct however that the requirement that "html" be in lowercase is on
> consistent with the stated aims as it is neither required for the document to
> be XML well formed nor for it to be conforming HTML5. I suggest this
> requirement be dropped. (your ALT 2)

Yes, that seems like the right conclusion then.  I changed the title of the bug, accordingly.

Btw, are you 100% that there is no trouble for someone with an XML toolchain if the casing of the 'html' string of the DOCTYPE is 'wrong'?
Comment 3 David Carlisle 2012-07-10 14:34:37 UTC
(In reply to comment #2)
> (In reply to comment #1)

> Btw, are you 100% that there is no trouble for someone with an XML toolchain if
> the casing of the 'html' string of the DOCTYPE is 'wrong'?

The resulting file is well formed but not valid. Whether that causes problems depends on what you are doing. But most polyglot documents are likely to be well formed but not valid.
Comment 4 Leif Halvard Silli 2012-07-10 15:12:26 UTC
(In reply to comment #3)

I suggest that the editor just keeps the text as is, but adds, in a parenthesis - or via some other method -  that the casing does not impact on the validity or well-formedness:


  ]] The string html is in lowercase letters. (RECOMMENDED, but not REQUIRED.) [[

Justificaiton: I think it is good to say that the casing does not matter - this removes one more 'XML-is-so-difficult" misconception. But on the other hand, it is also complication to send the message that "you can do it in 100 different ways". Additionally - and unlike for text/html - it is a well-formed constraint for XML that the string DOCTYPE (and when used, the string SYSTEM and PUBLIC) is uppercase. Hence, it is important to not create the impression that on can do <!doctype html>.

(In other words: <!DOCTYPE HTML> would be well-formed but not valid, while <!doctype html> would not be well-formed.)
Comment 5 Eliot Graff 2013-04-08 20:08:21 UTC
Removed the requirement for lowercase "html" and added the following note in Section 4:

Note

The string html SHOULD be in lowercase letters, in order to be both well-formed and valid XML; however, the string MAY be in mixed case or uppercase letters and still be well-formed XML. 


new revision: 1.94; previous revision: 1.93
Comment 6 Eliot Graff 2013-04-08 20:11:01 UTC
Resolving as fixed.

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If
    you are satisfied with this response, please change the state of
    this bug to CLOSED. If you have additional information and would
    like the Editor to reconsider, please reopen this bug. If you would
    like to escalate the issue to the full HTML Working Group, please
    add the TrackerRequest keyword to this bug, and suggest title and
    text for the Tracker Issue; or you may create a Tracker Issue
    yourself, if you are able to do so. For more details, see this
    document:

       http://dev.w3.org/html5/decision-policy/decision-policy.html

    Status: Accepted
    Change Description: See comment 5
Comment 7 Leif Halvard Silli 2013-04-08 21:39:38 UTC
(In reply to comment #5)


Hi Eliot. If there is no DTD, even lowercase fails to be valid. But we have not principally ruled out a DTD, I think, so I guess this is OK.