Re: Updated DOCTYPE versioning change proposal (ISSUE-4)

On Jan 2, 2010, at 4:21 PM, Larry Masinter wrote:

> The proposal was updated significantly, based on comments. I’ve  
> tried to address the “compound” issue as well.
>
> Here’s a version with all the parts of a change proposal into a  
> single document.  Since the discussion has been long, the rationale  
> is long.

Thanks for the revisions. I've updated the issue status list to point  
to the new version.

  - Maciej


>
> Summary:
>  Describe the DOCTYPE element and provide for allowing DOCTYPE  
> definitions.
> Rationale:
>
> The DOCTYPE has been part of HTML since its earliest versions, and  
> is still required.  This change proposal makes its history and use  
> clearer, without introducing any HTML interpreter changes.
> The HTML is intended to replace previous versions of the HTML  
> specification and the definition of the text/html MIME type.   
> Redefining a MIME type should not make previously conforming  
> documents non-conforming; even if features are  
> “deprecated” (conforming but not recommended), the conforming but  
> not-recommended constructs described completely.
> This feature is “in scope”. There was an argument that features only  
> intended for use in “controlled environments” were not in scope for  
> the HTML working group.  (This is discussed in http://lists.w3.org/Archives/Public/public-html/2010Jan/0013.html 
>  )
> In particular, the working group intends to support “polyglot”  
> documents which are both valid XML and XHTML and also valid as HTML  
> text/html; since XML workflows often require a !DOCTYPE with a  
> PublicIdentifier and a SystemIdentifier, this increases the  
> footprint of “polyglot” documents.
> Other ideas for including a new versioning mechanism have been  
> floated, e.g., an attribute on the <html> element. However, those  
> alternatives have disadvantages – they would introduce the  
> possibility of inconsistencies, where the DOCTYPE contains one  
> version string and the version attribute contains another, and have  
> little or no benefit. In particular, there were claimed advantages  
> of a version attribute on the html element rather than using DOCTPYE:
> It was claimed that such a version indicator “easier to type  
> correctly from memory”:
>  If a HTML author is relying on memory, the author should leave out  
> the HTML version string and use the <!DOCTYPE html> form, since it  
> is clearly not a “controlled environment”.
> In any case, a simpler version indicator is not useful because in  
> fact HTML evolves more continuously and a version indicator that was  
> easy to remember would not actually address the use cases where a  
> version indicator is actually useful.
>  It was claimed that such a version indicator would be “easier to  
> read”:
> Even if it were true,  “ease of reading HTML markup directly” is not  
> a strong design goal for HTML, compared to other uses.
> The proposal below recommends omitting a version indicator except in  
> limited situations, and recommends readers ignore the version  
> indicator except for specific purposes, so that “ease of reading”  
> only matters in limited situations anyway.
> Whether something is “easy to read” is not an independent factor,  
> but dependent on context and familiarity. Since the DOCTYPE element  
> is there anyway, and web authors are familiar with it, and it is  
> documented in every book, online tutorial and other HTML reference,  
> using “DOCTYPE” for a version indicator will result in documents  
> that are “easier to read” because of familiarity.
> There was an argument that the change proposal was somehow related  
> to “vastly increased reverse-engineering costs”. This argument does  
> not apply to this change proposal, see http://lists.w3.org/Archives/Public/public-html/2010Jan/0011.html 
>  .
> The current HTML5 spec says the DOCTYPE is “mostly useless”.  This  
> wording should change:
> It was claimed that this means the same thing as “of limited  
> utility”. In fact, an informal survey showed that  “mostly useless”  
> and “of limited utility” meant different things to a number of people:
>  “mostly useless” was much “stronger”
> “mostly useless” meant that in almost all situations, the utility  
> was zero, while “of limited utility” meant that the utility was less  
> than expected but not uniformly different.
> Even if “mostly useless” and “of limited utility” could mean the  
> same thing in some contexts, “mostly useless” was called “childish”  
> or “petulant” and “inappropriate in a formal standards document”.
> Many of the arguments made in previous discussions about versions  
> and doctypes were not careful to distinguish between “version of  
> specification” and “version of implementation”. It should be noted  
> that many *want* a version indicator to note “version of  
> implementation”, i.e., as an indicator of “best viewed by FireFox  
> 4.0 or later” or some such.  However, this change proposal is very  
> clearly providing for a version of a “specification”, and, in  
> particular, of the HTML specification, with the possibility of “mix”  
> specifications added.
> Many of the arguments in previous discussions were arguing against  
> version-specific browser behavior. But this change proposal  
> specifically does NOT allow for (any additional) version-specific  
> behavior, and in fact explicitly disallows it.
> There was one suggestion that, instead of PublicIdentifier and  
> SystemIdentifier, that ONLY the SystemIdentifier be allowed, but  
> that the RFC 3151 URN version of the PublicIdentifier might be  
> supplied, e.g.,
> <!DOCTYPE SYSTEM “urn:publicid:-:W3C+HTMLWG+hixie:nonsgml+html 
> +20100401:en”>
> rather than
> <!DOCTYPE PUBLIC “-//W3C HTMLWG hixie//NONSGML HTML 20100401//EN” about:legacy-compat 
> >
> This suggestion is interesting but doesn’t seem improve anything  
> (since the URN isn’t easily resolvable) when considering  
> compatibility with existing deployed XML editing workflows.
> While everyone *hopes* there are never going to be any further  
> incompatible changes to HTML in the future, there *is* a possibility  
> that in some unfortunate situation, it will be necessary to  
> introduce incompatible changes. In that case, it will be necessary  
> to introduce a new version indicator, to allow (alas) processors to  
> determine which of the incompatible interpretations was meant. While  
> this will be unfortunate, it would be doubly unfortunate to have to  
> introduce a new “place” for a version indicator that was previously  
> non-conforming, which would cause even worse uproar, because  
> documents that *didn’t* want the new incompatible behavior would  
> have no place to say explicitly that which version of the  
> incompatible behavior they wanted. By *allowing* a verison indicator  
> in conforming content today, we can avert more serious damage.  
> Having a location for a version indicator, even if it isn’t  
> explicitly used, allows it to be used at some point in the future.  
> In the history of computer languages, there are no languages that  
> have not evolved, been extended, or otherwise "versioned" as long as  
> the language has been in use.  This applies to network protocols,  
> character encoding standards, programming languages, and certainly  
> to every known technology found on the web. There are no known cases  
> where a language hasn't gone through some at least minor  
> incompatible change. The standards process is established as a way  
> of evolving specifications and implementations in a way to reduce  
> the likelihood of complete failure to interoperate, but certainly  
> not to guarantee that no incompatible changes will be needed in the  
> future.
> There was a suggestion that the final “EN” in the PublicIdentifier  
> might be omitted, but that didn’t seem to be allowed in the FPI  
> syntax after all, and if we’re going to be FPI compatible, might as  
> well pick up the whole thing. That’s why “NONSGML” was added too.
>
>
> See also background document http://www.w3.org/2001/tag/doc/versioning-html/versioning-html-20090611.html 
>  “Architectural Considerations for Language Versioning on the Web”.
>
>  For additional rationale and discussion, seethe HTML WG tracker  
> ISSUE-4:  http://www.w3.org/html/wg/tracker/issues/4
>
> Impact:
>
> This proposal does not add any new headers or elements to HTML.  It  
> more clearly shows the evolution and reasons for no longer relying  
> on DOCTYPE to affect browser behavior.
>
> This proposal does not require any changes to any browser or HTML  
> interpreter; existing behavior is maintained.
>
> It allows but does not require some validators to perform additional  
> validation, in that there may be additional validation based on the  
> PublicIdentifier or SystemIdentifier.   As behavior does not depend  
> on the DOCTYPE, validating the DOCTYPE is not required.
>
> This proposal allows some HTML documents that were previously  
> conforming to remain conforming.  It also allows the continued use  
> of PublicIdentifier and/or SystemIdentifier DOCTYPEs to be valid in  
> new documents.
> Specific proposal:
>
> replace section 9.1.1 of the HTML5 specification with:
>
> 9.1.1 The DOCTYPE
>
> The DOCTYPE header element is a required element. Originally, when  
> HTML was defined as an application of SGML (see [ISO8879]), a valid  
> HTML document declared what version of HTML was used in the  
> document, with a document type declaration which named the document  
> type definition (DTD) in use for the document.  In practice, web  
> authors have not been careful to consistently label versions, and  
> many, if not most, HTML documents on the web do not conform to the  
> DTD that they specify.
>
> It is common for implementations to trigger wildly different  
> behavior (“quirks” modes) due to the presence of specific DOCTYPE  
> declarations, or the absence  of a declaration altogether; see  
> section 9.2.5.4 for details of this behavior.
>
> For these reasons, the DOCTYPE header is REQUIRED for HTML content  
> served as text/html (and optional for content served as an XML media  
> type), but supplying an explicit version indicator is NOT  
> RECOMMENDED except in limited circumstances.
>
> The syntax of the DOCTYPE element is:
>
> <!DOCTYPE html>
> <!DOCTYPE html PUBLIC “PublicIdentifier” “SystemIdentifier”>
> <!DOCTYPE html SYSTEM “about:legacy-compat”>
>
> <!DOCTYPE html> is the simplest, recommended form of the DOCTYPE  
> declaration.
> The use of public identifiers (required in HTML 4.01) is discouraged  
> in this specification; some public identifiers may trigger different  
> behavior in deployed browsers (Section [#quirks-mode] in this  
> document and [hsvonin]).
> The SystemIdentifier is syntactically a URI (not a “URL” or “IRI”).  
> The SystemIdentifier was intended to be a locator for downloading a  
> DTD and entity sets in generic SGML and XML processors, and some XML  
> workflows designed to produce HTML require either a well- 
> knownPublicIdentifier , or else a SystemIdentifier that can actually  
> be fetched.
> The special URI “about:legacy-compat” is reserved for use as a  
> SystemIdentifier in a declaration of the form:
>                   <!DOCTYPE html SYSTEM “about:legacy-compat”>.
> Except for explicitly defined behavior (used to trigger “quirks  
> mode”, see section [#parse-behavior], [#quirks-mode] and [hsvonin]),  
> implementations which consume HTML MUST NOT use the DOCTYPE element  
> to trigger different processing behavior.
> Implementations which validate HTML content SHOULD use the latest  
> version of this specification to validate against; validating only  
> against older specifications, or only against the indicated version,  
> is likely to be much less useful.  See Section [#validation].
> HTML  documents not served as an XML media type MUST include a  
> DOCTYPE header, since many browsers, in the absence of a DOCTYPE  
> header, will trigger a “quirks” mode of rendering.
> Documents served as an XML media type MAY include a DOCTYPE header,  
> either to allow compatible content (so-called “polyglot” documents  
> which are both valid HTML and also valid XHTML) or to support  
> version-specific XML processing. While the DOCTYPE header is not  
> required, including may help in XHTML/HTML crossover.
>
> “html”, “PUBLIC” and “SYSTEM” are case insensitive, may have  
> additional spaces around them. The “PublicIdentifier” and  
> “SystemIdentifier” may use either double or single (apostrophe)  
> quote marks.
>
> Note that XML allows additional forms of DOCTYPE declarations which  
> are; however, this proposal is compatible with most widely deployed  
> XML software.
>
> In most instances, the simple <!DOCTYPE html> form is all that is  
> required or recommended. The form with the “SYSTEM about:legacy- 
> compat” is provided to allow for XSLT processors.
>
> 9.1.1.1 Public Identifier
>
> A  PublicIdentifier SHOULD NOT be used unless the content is being  
> managed in a controlled environment where the intended version is  
> known, and the document is well-formed; this might be the case in  
> some XML-based workflows and editing environments, or content  
> management systems and other production workflows.
>
> Even though HTML is no longer being defined as an SGML application,  
> previous versions of HTML were, and so the format of  
> PublicIdentifier was defined to be consistent with Formal Public  
> Identifiers of SGML (http://xml.coverpages.org/tauber-fpi.html).
>
>  Until this specification is approved as a W3C recommendation, the   
> PublicIdentifier  MAY identifying the specification referenced and  
> its date.  The pattern for the PublicIdentifier is simple. The  
> primary template is only the date in yyyymmdd terms:
>
> “-//WHATWG//NONSGML HTML 20100401//EN"                              
> for the 2010 April 1 version of the WhatWG edition of the  
> specification.
> “-//W3C HTMLWG//NONSGML HTML 20100401//EN”                    for  
> the HTML working group editor’s draft of the same date.
>
>  If multiple alternative specifications are available in a  
> committee, the draft’s or author’s nickname or handle may be used to  
> distinguish which specification is being referenced, e.g.,
>
> “-//W3C HTMLWG hixie//NONSGML HTML 20100401//EN”
> “-//W3C HTMLWG manu//NONSGML HTML 20100401//EN”
>
> When this specification becomes a W3C Recommendation, and only then,  
> the  PublicIdentifier:
>     “-//W3C//NONSGML HTML 5.0//EN”
> may be used.
>
> However, HTML documents MUST NOT use “-//W3C//NONSGML HTML 5.0//EN”  
> until the edition of this specification referenced is actually  
> approved and published as a W3C Recommendation.
>
> Note that non-standard behavior may ensue from using any of many  
> well-known Public Identifiers; these were chosen not to trigger any  
> such behavior.
>
> 9.1.1.2 PublicIdentifier for compound specifications
>
> Note that a PublicIdentifier only identifies a single specification,  
> not a complete implementation, a suite of specifications, or a  
> combination of vocabularies from multiple specifications. In order  
> to construct a PublicIdentifier for such a combination requires  
> publication of an actual specification which describes that  
> combination.
>
> Groups wishing to support the combination of HTML and other  
> specifications may supply short specifications showing how  
> additional vocabularies may be used with HTML; for example, a short  
> document “how to use RDFa with HTML” might be published. (This  
> document would reference RDFa and HTML but not include either  
> specification). In such case, the “+” format might be used:
>
> “-//W3C RDFAWG//NONSGML HTML+RDFa 20100401//EN” might reference the  
> HTML+RDFA document published by the RDFA working group.
>
> The W3C Hypertext coordination group is encouraged to coordinate  
> assignment of public identifiers.
>
> 9.1.1.3  SystemIdentifier
>
> The SystemIdentifier is a URL, either relative or absolute.
> If no PublicIdentifier is supplied, the effect is to not have a  
> version at all. In this case, the SystemIdentifier “about:legacy- 
> compat” should be used:     <!DOCTYPE SYSTEM “about:legacy-compat”>
>
> If a PublicIdentifier is supplied, the SystemIdentifier may be:
>
> An actual address (URL)  of a DTD and other XML material, as per the  
> XML specification, which can be fetched and used by an XML  
> processor. Note that W3C does not intend to supply or publish any  
> such URLs or DTDs. Note that no current URL used in HTML would  
> occur. This usage should only be used if the URL is actually  
> resolvable.
> The empty string, “” . This system identifier can be used in  
> situations where there is no fetchable material related to the XML  
> forms, but that a specific version indicator is wanted and supplied  
> by the PublicIdentifier.
>
>
>

Received on Monday, 4 January 2010 02:54:04 UTC