This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 1500 - XHTML-sent-as-text/html is parsed as XML
Summary: XHTML-sent-as-text/html is parsed as XML
Status: RESOLVED INVALID
Alias: None
Product: Validator
Classification: Unclassified
Component: Parser (show other bugs)
Version: 0.7.0
Hardware: PC Windows XP
: P2 major
Target Milestone: 1.0
Assignee: Olivier Thereaux
QA Contact: qa-dev tracking
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 14 22
  Show dependency treegraph
 
Reported: 2005-06-15 11:32 UTC by Ian 'Hixie' Hickson
Modified: 2011-03-17 12:42 UTC (History)
3 users (show)

See Also:


Attachments

Description Ian 'Hixie' Hickson 2005-06-15 11:32:11 UTC
According to the HTML WG, a UA is non-compliant if it handles an XHTML document
sent as text/html as XHTML; such a UA must apparently handle the document as
HTML regardless of what it looks like.

# [...] documents served as text/html should be treated
# as HTML and not as XHTML.
 -- http://lists.w3.org/Archives/Public/www-html/2000Sep/0024.html

The fact that the validator ignores this means that documents that don't comply
to appendix C of XHTML 1.0 are being marked as valid when in fact they aren't
conformant and won't be handled correctly.

This is causing people to file bugs on browsers (I've seen it happen to Opera,
Safari, and Mozilla) which are invalid. The browsers are doing the right thing,
but the documents are wrong. Yet the validator is telling them that the
documents are fine.

I would like to see the validator reject any XHTML-sent-as-text/html as being of
the wrong MIME type.
Comment 1 Olivier Thereaux 2006-08-15 05:30:00 UTC
(In reply to comment #0)
> According to the HTML WG, a UA is non-compliant if it handles an XHTML document
> sent as text/html as XHTML; such a UA must apparently handle the document as
> HTML regardless of what it looks like.

Could you give more precisions of what you mean by "treat as HTML" in the context of formal DTD validation, which is what the validator does?

> # [...] documents served as text/html should be treated
> # as HTML and not as XHTML.
>  -- http://lists.w3.org/Archives/Public/www-html/2000Sep/0024.html

Any normative reference would be more appreciated. I don't think it's a good idea to base the validator's behavior on a message in a w3c mailing-list.


> The fact that the validator ignores this means that documents that don't comply
> to appendix C of XHTML 1.0 are being marked as valid when in fact they aren't
> conformant and won't be handled correctly.

Notwithstanding the fact that, last I checked, the appendix C of XHTML 1.0 was an informative set of guidelines, checking documents against these guidelines is the work of a full checker (beyond conformance), and that's what is being developed at this point in time. The "unicorn" [1][2] project has a plug-in checking appendix C rules.


> I would like to see the validator reject any XHTML-sent-as-text/html as being of
> the wrong MIME type.

I do not see a direct link between the rest of your comment and the conclusion that XHTML-sent-as-text/html should be plain and simply rejected. Are you suggesting that it should "be treated as HTML", "checked against appendix C rules", or "rejected". Please precise your request.

One additional possibility I see would be to add a warning to the validator output whenever XHTML 1.0 is found served as text/html, and even that is arguable, as I don't think this is compatible with section 5.1 of the XHTML 1.0 specification - http://www.w3.org/TR/xhtml1/#media (I admit being confused by why the normative section 5.1 of the spec seems to refer to the informative appendix C, but that may just be me misunderstanding the specification).

Until we have the unicorn tool ready for prime time, my proposed solution is that whenever the validator finds an XHTML 1.0 doctype document served as text/html, it adds a note to its output encourageing the author to check their documents against the appC checker. 

Would that be an acceptable solution?

Also, please feel free to send in test cases, as well as patch proposals, which would help us treat your request quickly.

Thank you.
Comment 2 Ian 'Hixie' Hickson 2006-08-15 05:50:09 UTC
(In reply to comment #1)
> Could you give more precisions of what you mean by "treat as HTML" in the
> context of formal DTD validation, which is what the validator does?

I mean handle as described by the XML spec instead of handled as described by the SGML spec.


> Any normative reference would be more appreciated. I don't think it's a good
> idea to base the validator's behavior on a message in a w3c mailing-list.

The only normative reference is the HTML5 working draft:

   http://www.whatwg.org/specs/web-apps/current-work/#authors-using-html

You'll never see a normative reference from the ex-HTML working group, they never update their errata. The best you'll see from them is the e-mail I posted above.


> > I would like to see the validator reject any XHTML-sent-as-text/html as being of
> > the wrong MIME type.
> 
> I do not see a direct link between the rest of your comment and the conclusion
> that XHTML-sent-as-text/html should be plain and simply rejected. Are you
> suggesting that it should "be treated as HTML", "checked against appendix C
> rules", or "rejected". Please precise your request.

Any of those three options would be fine by me.


> Until we have the unicorn tool ready for prime time, my proposed solution is
> that whenever the validator finds an XHTML 1.0 doctype document served as
> text/html, it adds a note to its output encourageing the author to check their
> documents against the appC checker. 
> 
> Would that be an acceptable solution?

So long as it doesn't make the author think that what they're doing is ok, I'll be happy.


> Also, please feel free to send in test cases, as well as patch proposals, which
> would help us treat your request quickly.

See bug 14, where Terje cited a testcase that I wrote. The two testcases that I wrote for this were:

   http://damowmow.com/playground/html-not-xml.html
   http://damowmow.com/playground/html-not-xml-2.html

Some other testcases would be http://www.w3.org/TR/xhtml2/ or http://www.w3.org/People/olivier/ for example. Both of those should be flagged as being incorrectly labelled, as they are XHTML but are sent as text/html.
Comment 3 Terje Bless 2006-08-15 06:24:00 UTC
(In reply to comment #1)
> One additional possibility I see would be to add a warning to the validator
> output whenever XHTML 1.0 is found served as text/html, []

Iff an XHTML 1.0 document is served as text/html it may be appropriate to issue an informative note indicating that serving XHTML 1.0 as text/html requires compliance with Appendix C and that checking whether that is the case is beyond the current scope of the Validator.

This, of course, leaves us the risk of a slippery slope with having to issue notices for every other thing we do not in fact check for in the Validator (full conformance springs to mind).
Comment 4 Terje Bless 2006-08-15 06:32:01 UTC
(In reply to comment #2)
> The only normative reference is the HTML5 working draft:

I'll... reserve judgement on the normative weight of WHAT WG specifications.


>    http://www.whatwg.org/specs/web-apps/current-work/#authors-using-html

That seems actively incompatible with existing practice. Are you sure you understand what "<!DOCTYPE html>" actually means in an SGML document instance? Do documents produced according to that specification really validate as HTML 4.01?
Comment 5 Ian 'Hixie' Hickson 2006-08-15 06:36:15 UTC
HTML5 isn't based on SGML.

Anyway, getting back to this bug, the point I was making originally is that the validator shouldn't be using the XML processor at all when the content is sent as text/html. It should exclusively be using the SGML/HTML processor. This is based on the requirement that Steven layed out in the post I cited in the original comment.

If you think this bug is INVALID or should be WONTFIXed, that's fine too; it will merely further underscore the points I made on www-tag. :-)
Comment 6 Terje Bless 2006-08-15 07:42:14 UTC
(In reply to comment #5)
> If you think this bug is INVALID or should be WONTFIXed, that's fine too; it
> will merely further underscore the points I made on www-tag. :-)

I will have to study this further, but I would initially guess this bug would then end up as effectively WONTFIX for (to put it that way) essentially the same reason as you outline on www-tag; marketshare.

The Validator was modified to have this behaviour in the wake of the release of XHTML 1.0 when it became clear that XHTML 1.0 must be parsed as XML to be Valid but in practice was always served as text/html. In order to practically support XHTML 1.0, the Validator -- against my better judgement at the time, and to much wailing and gnashing of teeth -- was modified to reintroduce "sniffing" features I'd previously spent a considerable number of hours removing for the various HTML 3.2/4.01 document types.

While the Validator does not take into account "marketshare" as the typical browser vendor will understand the term, the basis for the decision was evangelism (of Valid markup) and to stay relevant to the web community (rather than ride a hobby horse).

In retrospect we (I) may have made the wrong decision on this. As you note, the problem has only been compounded by letting "marketshare" dictate compliance in the interrim and -- without in any way endorsing the approach taken by the WHAT WG specification -- it is probably past time for the issue to be revisited and to let the experience gained in the intervening years serve as a guide forward.
Comment 7 Ian 'Hixie' Hickson 2006-08-16 00:15:34 UTC
Here is a valid XHTML1 file:
   http://damowmow.com/playground/html-or-xml.xml

Here is a valid HTML4 file:
   http://damowmow.com/playground/html-or-xml.html

This bug causes the second one of the two files above to be marked as being a valid XHTML1 file instead of being marked as an HTML4 file. (The two files are identical, they just have different MIME types.)
Comment 8 Ryan Schmidt 2006-08-21 11:45:52 UTC
Here is a bug I reported to a web site author, but they did not fix it because they believed that their page was fine, and that the browser was broken, because the page passed w3c validation:

http://pear.php.net/bugs/bug.php?id=7183

Here is a test page that I built while investigating the problem:

http://www.ryandesign.com/tmp/selfclose.php

You can switch the MIME type between HTML and XHTML. In HTML mode, when viewed in Safari (2.0.4) or Firefox (1.5.0.6), most of the page is red, which is not what the author intended -- the author intended for the anchor tag at the top to self-close. But in HTML mode the browsers do not support self-closing anchor tags, because that is against section 3 of appendix C of the XHTML spec which says an explicit empty element should instead be created. With the XHTML MIME type, the browser renders the page as the author intended. However, the validator says the document is correct with either MIME type, hence the author's confusion.
Comment 9 Olivier Thereaux 2006-08-30 06:46:34 UTC
Thanks Ian, Terje, Ryan for the discussion, making the issue clearer.
 
Let's target the upcoming maintenance release for a resolution of this issue.
Comment 10 Olivier Thereaux 2006-08-30 07:28:07 UTC
(most probably) relevant discussion on Bug #24, in particular:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=24#c8
Comment 11 Olivier Thereaux 2007-03-15 04:24:37 UTC
Seeing as new work on HTML at w3c may eventually provide welcome clarification on this issue (which isn't clear to me yet, with apologies to all who have been providing pointers and thoughts here), I'm moving the target milestone to not slow down the 0.8.0 release.

Keeping as NEW, too.
Comment 12 Olivier Thereaux 2007-04-30 15:49:24 UTC
This bug has been puzzling me for the longest time, but I think I am finally grasping it...

Quoting Shane McCarron (XHTML WG) in 
http://lists.w3.org/Archives/Public/www-validator/2007Apr/0175.html
[[
All XHTML family docment types should be processed using the XML parsing 
mode of the validator.  There is never a case where the SGML parsing 
mode would work, since all the DTDs are XML DTDs, not SGML DTDs.
]]

I never really understood Steven's saying that "documents served as text/html should be treated as HTML", nor found any clear indication that it was relevant to the parse mode used by the validator. This confirmation by Shane that the XHTML family of document are based on XML DTDs and therefore should obviously use the XML mode is the kind of disambiguation that I was looking for.

When Ian says: [[
I would like to see the validator reject any XHTML-sent-as-text/html as being of
the wrong MIME type.
]] 
I have to disagree based on http://www.w3.org/TR/xhtml1/#media

Also: [[
The fact that the validator ignores this means that documents that don't comply
to appendix C of XHTML 1.0 are being marked as valid when in fact they aren't
conformant and won't be handled correctly.
]]
The XHTML spec's prose is not as strong as you imply here. Appendix C is informative, and not referred to in the conformance section - http://www.w3.org/TR/xhtml1/#docconf

Granted, http://www.w3.org/TR/xhtml1/#media is a little confusing because it refers to an informative section (app C0, from a normative section(section 5.1 on media types is normative) - but not refered to in the conformance section. Maybe this will all be clarified in a future errata version of XHTML 1.0. In the meantime, I believe the practical course to follow is:
* to close this bug as "not a bug". There is nothing wrong with parsing XHTML in XML mode
* to keep making progress on integrating the Appendix C checker to the validator - see Bug 4514 - and figure out whether problems raised by the appC checker should be errors or warnings.

Comment 13 Ian 'Hixie' Hickson 2007-04-30 20:26:41 UTC
That's making the validator a purely theoretical tool and ignoring the pragmatic needs of authors. I guess that's your choice.
Comment 14 Shane McCarron 2007-04-30 20:32:03 UTC
In what way?  XHTML documents are always XML, in particular for purposes of validation. You cannot validate an XHTML document in "SGML" mode.  It makes no sense.  Yes, the XHTML 1.0 Recommendation permits serving such documents with a media type of text/html.  Yes, some existing user agents need XHTML to be served that way so that they will swallow it. That doesn't say anything about their parsing mode with regard to the W3C validator.  

If you like, I can get a formal resolution of the XHTML Working Group to this effect, but I guarantee you this is the correct interpretation.
Comment 15 Ian 'Hixie' Hickson 2007-04-30 20:54:48 UTC
It's 2007. If you still don't understand how this is wrong, I'm not really interested in arguing it. I fully expect the HTML working group to have very clear normative criteria explaining this in due course anyway.
Comment 16 Ryan Schmidt 2007-05-02 10:36:34 UTC
Oliver, I don't see why you've now changed your mind on this. My comment #8 from last year still applies, and all I can do is paraphrase it again. Heck, the original problem description from almost 2 years ago seems to still describe it pretty well too: The validator declares documents as valid though they violate the XHTML 1.0 spec, appendix C, section 3.

http://www.w3.org/TR/xhtml1/#guidelines

I'm not implying any strength of the prose of the spec, and I don't care whether that section is "normative" or "informative" or any other big word; it describes a behavior that web site authors and therefore browsers should follow, and popular browsers (Safari, Firefox) do in fact implement the described behavior. When web site authors do not follow the guidelines of that section of the spec, their web sites break under those browsers. When bug reports are filed on this with the web site authors, and the particular browser they're using (Internet Explorer, I think) does not exhibit this problem, and the web site author does one last check and validates the document using the w3 validator, they find it to be valid and in turn reject the bug report, saying the browser must be at fault. But it is not. It is behaving in accordance with the aforementioned spec.

I did not spend hours researching this problem just so you could decide 9 months later that it's not a problem after all. It is. It really is. Please reopen the bug and amend the validator.

It takes an effort to report bugs to web site authors (or anyone else). When those bug reports are rejected, some reporters may give up rather than fight them on it. The end result is that fewer bugs in web sites are fixed, and users of standards-compliant browsers are disadvantaged. That is surely not the end result we're all looking for.
Comment 17 Olivier Thereaux 2007-05-02 10:53:03 UTC
Ryan,

(In reply to comment #16)
> Oliver, I don't see why you've now changed your mind on this.

I have not "changed my mind". I have been very puzzled by this issue for the longest time, openly so. I have asked people for discussions and clarification. I eventually got a reasonable clarification from the people who made the xhtml standard, which the validator is here to implement and enforce. So I'm moving on.

> My comment #8
> from last year still applies, and all I can do is paraphrase it again. 

Your comment #8 does not apply to the decision to parse (or not) xhtml with the xml parser. Your comment applies to the usefulness of having extra checks for the appendix C in the case of XHTML sent as text/html. That's Bug 4514, and this bug is opened.