This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 6296 - text/html with XHTML 1.0 doctype and HTML5 override validates as XHTML5
Summary: text/html with XHTML 1.0 doctype and HTML5 override validates as XHTML5
Status: CLOSED FIXED
Alias: None
Product: HTML Checker
Classification: Unclassified
Component: General (show other bugs)
Version: unspecified
Hardware: PC Windows XP
: P2 normal
Target Milestone: ---
Assignee: Olivier Thereaux
QA Contact: qa-dev tracking
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-12-10 14:44 UTC by Simon Pieters
Modified: 2015-08-23 07:07 UTC (History)
3 users (show)

See Also:


Attachments

Description Simon Pieters 2008-12-10 14:44:43 UTC
http://validator.w3.org/check?uri=http%3A%2F%2Fcsszengarden.com%2F&charset=%28detect+automatically%29&doctype=HTML5&group=0

csszengarden.com is text/html, hence should use the HTML5 parser -- not the XML parser.
Comment 1 Olivier Thereaux 2008-12-10 15:02:49 UTC
Ack. Thanks for the report Simon.

Using debug mode shows validator choosing HTML5+XML based on "doctype" decision factor, which is a little odd.

http://validator.w3.org/check?uri=http%3A%2F%2Fcsszengarden.com%2F&doctype=HTML5&debug=1

Nevertheless, this is likely to be a bit of a pain, since the criteria to determine "is it XML" are different between html5 (media type) and the rest of the HTML family (media type, doctype, xml declaration...)
Comment 2 Simon Pieters 2008-12-10 15:13:16 UTC
I thought the rest of the HTML family also decided exclusively on media type -- at least that's what Opera, Firefox and WebKit have implemented.

It would be nice if the validator said which parser it used and which MIME type it got, and further it would be nice to have a parser override. (But maybe these things should be in separate reports.)
Comment 3 Olivier Thereaux 2008-12-10 16:00:38 UTC
(In reply to comment #2)
> I thought the rest of the HTML family also decided exclusively on media type --
> at least that's what Opera, Firefox and WebKit have implemented.

Basically, browsers and a validator are different classes of products. While it may be true that browsers can simply decide to "parse as html" when receiving text/html, for a validator there is no such thing as "parse as html" (or at least there wasn't before html5).

Why? Because browsers don't use DTDs or any kind of schema. 
On the other hand, that's precisely what validators do.

For anything before html5, a validator had a choice between SGML (for HTML4.01 and below) and XML (for XHTML 1.0 and up). The XHTML DTDs are XML DTDs, and XHTML documents MUST be parsed with an XML DTD validator. 

Try validating XHTML with an SGML validator and it will 
1) probably crash (or at least puke) because XML DTDs are different from SGML DTDs 
2) complain about all the XML-ish constructs such as <br /> 
  (because, in SGML, such constructs have a completely different meaning than in XML)
3) completely ignore issues about missing closing tags, etc 
(because, in SGML/HTML, omitting closed tags is OK, whereas it is NOT in XML)
etc.

I somehow wish the HTML working-group of old had been clearer about this: there has been much confusion and frustration, especially since Steven's infamous message on the topic:
http://lists.w3.org/Archives/Public/www-html/2000Sep/0024.html

Anyay, here's hoping that HTML5 can be/remain clearer on that matter.

> It would be nice if the validator said which parser it used and which MIME type
> it got, and further it would be nice to have a parser override. (But maybe
> these things should be in separate reports.)

the &debug=1 parameter does just that. 
It's not shown by default, for the sake of trying to keep the UI not-too-complicated.

Comment 4 Simon Pieters 2008-12-10 16:18:27 UTC
The validator *could* say that "well, this isn't XML, so I'm going to validate as HTML 4.01 instead" (or refuse to validate). But I digress.


&debug=1 is good to know... However, my request still stands. There's no parser override. I believe it's quite possible to provide parser and MIME information while keeping the UI not-too-complicated.
Comment 5 Simon Pieters 2008-12-10 16:35:31 UTC
Filed bug 6298 and bug 6299.
Comment 6 Olivier Thereaux 2008-12-10 16:39:03 UTC
(In reply to comment #4)
> The validator *could* say that "well, this isn't XML, so I'm going to validate
> as HTML 4.01 instead" (or refuse to validate). But I digress.

I think it would be counter productive to do that. 

If I am declaring XHTML 1.0, authoring valid XHTML 1.0 but serving it as text/html because I can't change my server config (the case for most people) or because I want the pages to show in IE6, I wouldn't want the validator to tell me "you suck! this is not valid HTML4.01". 

A bad way to alienate people, IMHO

> &debug=1 is good to know... However, my request still stands. There's no parser
> override. I believe it's quite possible to provide parser and MIME information
> while keeping the UI not-too-complicated.

Why not. I was about to ask for separate bugzilla items, but see you did that already. Thanks!
Patches and/or UI suggestions welcome, too.
Comment 7 Simon Pieters 2009-01-02 12:39:11 UTC
A slightly different scenario but probably the same bug: text/html HTML5 with xmlns declaration is misvalidated as XHTML5.

http://validator.w3.org/check?uri=http%3A%2F%2Fwww.aneventapart.com%2F&debug=1
Comment 8 Dean Edridge 2009-01-03 13:51:56 UTC
> 
> Nevertheless, this is likely to be a bit of a pain, since the criteria to
> determine "is it XML" are different between html5 (media type) and the rest of
> the HTML family (media type, doctype, xml declaration...)
> 

I don't see how that is relevant. Any web page using the new HTML doctype (aka the HTML5 doctype "<!DOCTYPE html>") should be passed over to the Validator.nu side of the W3C's validator. It's the Validator.nu that should be "deciding" if the document is HTML5 or XHTML5, *not* the W3C's validator. There has been hundreds of hours put in to programming the validator.nu, it works perfectly on http://validator.nu. There's no need for the W3C's validator to mimic all those algorithms. Is that what's being suggested, or am I missing something here?

Here is all the W3C's validator needs to do. 

if document is normal HTML4/XHTML1 etc
    
    do the normal validator.w3.org stuff

if document is using HTML5 doctype

    send it over to the validator.nu part of the validator to sort out.

Then the validator.nu part of the W3C's validator can determine whether it is HTML5 or XHTML5 based on the mime type, file ext etc.

I was under the impression that this was the way it was set up already.

Of course, there's the issue of XHTML5 without a doctype, but I'll comment on that on the other bug report. :)
Comment 9 Olivier Thereaux 2009-01-03 15:22:55 UTC
(In reply to comment #8)
> I don't see how that is relevant.

If it is so obvious, you could provide an obvious patch?

> Any web page using the new HTML doctype (aka
> the HTML5 doctype "<!DOCTYPE html>") should be passed over to the Validator.nu
> side of the W3C's validator.

Correct.
Comment 10 Olivier Thereaux 2009-01-03 15:28:07 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > I don't see how that is relevant.
> 
> If it is so obvious, you could provide an obvious patch?

forgot a " ;) "

working on a patch, FWIW...
Comment 11 Olivier Thereaux 2009-01-03 15:29:51 UTC
Working on dev now.

http://qa-dev.w3.org/wmvs/HEAD/check?uri=http%3A%2F%2Fwww.aneventapart.com%2F
Comment 12 Dean Edridge 2009-01-03 16:14:23 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > I don't see how that is relevant.
> 
> If it is so obvious, 

Sorry, I wasn't suggesting it was obvious, just trying to come up with an algorithm. I was just a bit concerned over how the validator.w3.org and the validator.nu work together, that's all. :-) but it looks like it's coming together well.

> you could provide an obvious patch?

Well, if you can show me how to set up the validator on Ubuntu or windows XP (I've tried before, but with mixed results) I'll have a go at writing a patch for the next bug I find :)

Keep up the good work

Comment 13 Ville Skyttä 2009-01-04 20:42:35 UTC
The implementation in CVS used the "textarea input" mode for validator.nu without passing the "parser" parameter, which is against a "should" in upstream docs.  I changed it to use the "HTTP entity body" mode which solves this problem as well as allows in non-doctype-override, non-charset-override scenarios us to pass the original document and its content-type and charset intact to validator.nu.  As a side effect, this also allows us to use gzipped requests to validator.nu with libwww-perl which is recommended by upstream docs.

http://www.w3.org/mid/E1LJZgk-0007br-NR%40lionel-hutz.w3.org
Comment 14 Olivier Thereaux 2009-01-05 12:44:55 UTC
(In reply to comment #13)
> The implementation in CVS used the "textarea input" mode for validator.nu
> without passing the "parser" parameter, which is against a "should" in upstream
> docs.  I changed it to use the "HTTP entity body" mode 

Good catch!
Comment 15 Ville Skyttä 2009-01-05 16:11:07 UTC
Thanks.  One correction, just for the record:

(In reply to comment #13)
> gzipped requests to validator.nu with libwww-perl which is recommended by
> upstream

I seem to have confused this with the upstream recommendation to take advantage of compressed _responses_; the docs are more neutral about request compression.  Well, we have both in CVS now anyway :)