14565 – Chain of normative statements connecting MIME type to HTML vs. XHTML is broken or unobvious

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 14565 - Chain of normative statements connecting MIME type to HTML vs. XHTML is broken or unobvious

Summary: Chain of normative statements connecting MIME type to HTML vs. XHTML is broke...

Status:	RESOLVED FIXED

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-10-26 07:11 UTC by Henri Sivonen
Modified:	2011-11-01 06:14 UTC (History)
CC List:	6 users (show)

See Also:

Attachments

Description Henri Sivonen 2011-10-26 07:11:16 UTC

I tried to find normative statements in the spec that say clearly connect the text/html MIME type to the definition of HTML syntax and HTML document and clearly connect non-text/html MIME type(s) to the definition of XHTML syntax and XHTML document.

I failed to find the relevant definition chains when I ignored informative statements (green notes and sections marked informative). That is, right now, it's hard to find spec-based proof for the claim that HTML vs. XHTML depends on the MIME type alone for serialization and on the HTMLness flag alone for document trees.

Please add the normative statements back if they've been taken out (I think there was a chain of normative statements a couple of years ago) or make them clearer if they the statements are there and I just failed to find them.

Comment 1 Ian 'Hixie' Hickson 2011-10-26 21:20:29 UTC

Are you looking for the chain that says when you receive a document of one type you have to use a particular parser, or the chain that says that if you write a document of one type you have to use a particular type, or something else?

Comment 2 Henri Sivonen 2011-10-27 08:43:39 UTC

I'm looking for the following:

 * A normative requirement that says which parser must be used for which input MIME types.

 * A normative terminology definition for an "XHTML document". (The spec talks about writing and parsing them suggesting that an "XHTML document" is at least a sequence of characters or bytes.)

 * Better clarity that the terms "HTML document" and "XML document" are normatively defined by DOM4 / Web DOM Core.

 * Since "HTML document" and "XML document" are defined to be data structures in DOM4 but an "XML document" is defined as textual source in the XML spec, better clarity about terminology (possibly saying that the terms refer to both data structures and their serializations) and an explanation how one can "write" "HTML documents" if they are defined as data structures rather than serializations.

 * A normative statement requiring a serialization constructed by following the "writing HTML documents" section to be labeled as text/html and a requirement for a serialization constructed by following the "writing XHTML documents" section to be labeled as application/xhtml+xml.

I'd be OK with fixing this by saying the following:

User agents must parse text/html resources using according to the Parsing HTML Documents section. (Note: The rules in that section construct a Document object that is marked as being an HTML document.) User agents must parse application/xhtml+xml resources and application/xml resources according to the Parsing XHTML Documents section. (Note: The rules in that section construct a Document object that is marked as being an XML document.)

The terms HTML document, XML document and XHTML document can refer to both data structures and sequences of bytes. When referring to data structures, the terms "HTML document" and "XML document" are defined in DOM4. When referring to data structures, "XHTML document" is defined (right here) as an XML document whose root element is in the HTML namespace.

When referring to streams of bytes an "HTML document" is a stream of bytes labeled as text/html by out-of-band metadata. [Defining XML document left as an exercise to the editor, because the XML spec defines XML documents as streams that satisfy particular syntactic requirements, which leaves the problem what to call byte stream that are labeled as application/xml but don't satisfy the syntactic requirements.] When referring to stream of bytes, an XHTML document is an XML document which upon parsing would result in a data structure whose root element is in the HTML namespace.

[Requirements for labeling when writing docs in a way that avoids circularities left as an exercise to the editor.]

Comment 3 Ian 'Hixie' Hickson 2011-10-27 20:05:36 UTC

>  * A normative requirement that says which parser must be used for which input
> MIME types.

For text/html, this is now done (by fixing a bogus requirement that was attempting to do this before). For XML, it's done in the context of navigation. I don't think it's our place to do this for XML in other contexts.


>  * A normative terminology definition for an "XHTML document". (The spec talks
> about writing and parsing them suggesting that an "XHTML document" is at least
> a sequence of characters or bytes.)
>
>  * Better clarity that the terms "HTML document" and "XML document" are
> normatively defined by DOM4 / Web DOM Core.
> 
>  * Since "HTML document" and "XML document" are defined to be data structures
> in DOM4 but an "XML document" is defined as textual source in the XML spec,
> better clarity about terminology (possibly saying that the terms refer to both
> data structures and their serializations) and an explanation how one can
> "write" "HTML documents" if they are defined as data structures rather than
> serializations.

I've added a few paragraphs defining these, but it's not very satisfactory. The problem is that we have multiple concepts and we all refer to all of them as documents, and actually going through the specs to use different terms for each one would at this point be quite time-consuming and error-prone. If there are specific occurrences where the ambiguous terms cause real trouble, please let me know.


>  * A normative statement requiring a serialization constructed by following the
> "writing HTML documents" section to be labeled as text/html and a requirement
> for a serialization constructed by following the "writing XHTML documents"
> section to be labeled as application/xhtml+xml.

I'm not sure what such a requirement would mean. For example, what would the implications be on a polyglot document's labeling? Would it mean that you could never label such a document that happened to match the HTML syntax as as text/plain? Would it mean that you should not label non-conforming documents as text/html, separate from their not being conforming in the first place?

Instead, the spec specifies (in the text/html definition) that the act of using the text/html label declares that the resource is HTML. Is that sufficient?

Comment 4 contributor 2011-10-27 20:06:29 UTC

Checked in as WHATWG revision r6771.
Check-in comment: A first pass (for this quarter, anyway) at cleaning up some terminology around the word 'document'.
http://html5.org/tools/web-apps-tracker?from=6770&to=6771

Comment 5 Henri Sivonen 2011-10-27 20:22:09 UTC

(In reply to comment #3)
> >  * A normative requirement that says which parser must be used for which input
> > MIME types.
> 
> For text/html, this is now done (by fixing a bogus requirement that was
> attempting to do this before).

OK.

> For XML, it's done in the context of navigation.

Where? I tried following reverse links for XML MIME type but didn't find a navigation-related fetch result dispatched on XML MIME type.

> I've added a few paragraphs defining these, but it's not very satisfactory. 

Thanks.

> Instead, the spec specifies (in the text/html definition) that the act of using
> the text/html label declares that the resource is HTML. Is that sufficient?

Yes, thanks. The problems around polyglot, non-conforming streams, etc., are why I said left as an exercise to the editor. :-)

Comment 6 Ian 'Hixie' Hickson 2011-11-01 06:14:37 UTC

> > For XML, it's done in the context of navigation.
> 
> Where? I tried following reverse links for XML MIME type but didn't find a
> navigation-related fetch result dispatched on XML MIME type.

Search for "Any other type ending in "+xml" that is not an explicitly supported XML type" in the navigation section (step 20, at the moment). That takes you to #read-xml which then defers to the XML spec.

(Marking FIXED since you sound happy with the edits; please reopen if you reply.)