24 – HTML::Parser in XML mode doesn't work with lowercase doctypes

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 24 - HTML::Parser in XML mode doesn't work with lowercase doctypes

Summary: HTML::Parser in XML mode doesn't work with lowercase doctypes

Status:	RESOLVED FIXED

Alias:	None

Product:	Validator
Classification:	Unclassified
Component:	Parser (show other bugs)
Version:	0.6.0b1
Hardware:	PC Linux

Importance:	P2 normal
Target Milestone:	1.0
Assignee:	Ville Skyttä
QA Contact:	qa-dev tracking

URL:	http://www.w3.org/Style/CSS/learning
Whiteboard:
Keywords:

Duplicates (2):	904 1280 (view as bug list)
Depends on:
Blocks:	14
	Show dependency tree / graph

Reported:	2002-10-25 19:41 UTC by Ville Skyttä
Modified:	2008-12-02 22:42 UTC (History)
CC List:	3 users (show)

See Also:

Attachments

Description Ville Skyttä 2002-10-25 19:41:30 UTC

We're unconditionally setting HTML::Parser into xml_mode() in preparse(). 
Seemingly this means that the doctype declaration isn't recognized in this mode
if it's in lowercase.

I don't think this is a HTML::Parser bug; the whole doctype declaration *is*
case sensitive in XML, whereas in SGML eg. the "DOCTYPE" and "PUBLIC" words are
perfectly legal in lowercase.

AFAICT, this won't affect the validation, but the "valid" response page doesn't
know the document type, hence reporting only "This page is Valid!".

Perhaps preparse() should take the already available information (Content-Type,
possibly the XML declaration) into account when setting the mode of HTML::Parser.

Oh, and this is with HTML::Parser 3.26.  And see the URL for a test case.

Comment 1 Terje Bless 2002-10-27 10:50:09 UTC

I see no easy way to achieve this right now. Switching xml_mode() on and off
based on available information at that point involves too much guesswork. I
think is a perfect case for documenting it as a known limitation and moving on.

Comment 2 Terje Bless 2002-10-27 11:09:02 UTC

Reset target to 1.0 (aka. "Blue Sky" ;D) and reassigning to Ville.

I see no sane way to do it, but perhaps you do? If not, either leave the bug
open in the hopes that we'll find a way later, or close it.

Comment 3 Ville Skyttä 2002-10-27 11:34:22 UTC

I can't come up with a robust way to do it either at the moment.  But I haven't
quite given up yet, leaving open...

Comment 4 Bj 2004-10-08 20:44:31 UTC

*** Bug 904 has been marked as a duplicate of this bug. ***

Comment 5 Ville Skyttä 2005-07-20 13:51:21 UTC

*** Bug 1280 has been marked as a duplicate of this bug. ***

Comment 6 Sascha Wilde 2006-01-06 11:37:28 UTC

Why was this closed as "INVALID"?

The bug is still not solved and there was no explanation, why the
reported bug was considered "invalid".

Comment 7 Olivier Thereaux 2006-01-06 11:56:33 UTC

(In reply to comment #6)
> Why was this closed as "INVALID"?

Because someone is having a lot of fun today messing up the state of bugs in this database.

Comment 8 Olivier Thereaux 2006-08-30 07:26:22 UTC

Discussions on Bug #1500 and others seem to point to one solution:
- if the content-type is text/html, then set HTML::Parser in html without xml_mode()

- if the content-type is anything else (as far as I can tell, all other document types supported by the validator are xml-based), then set HTML::Parser

The current code for check goes:
1) set temporary parse mode to SGML, XML or TBD based on content-type
( check v 1.432.2.11; lines 188-193 )
2) run $File = &preparse_doctype($File); in systematic XML mode.
( check v 1.432.2.11; line 526 )
3) if parse mode is still TBD, based on doctype found, set parse mode to SGML or XML
( check v 1.432.2.11; lines 530-560 )

If I understand the proposal of Bug #1500, the new behavior would be
1) set parse mode to SGML, XML based on content-type. text/html systematically means SGML mode
2) run $File = &preparse_doctype($File); using detected parse mode as mode for HTML::Parser and subsequent doctype detection.

If the above is correct, then it seems fixing this bug (Bug #24) is almost immediate once Bug #1500 is resolved. 

Am I missing something?

Comment 9 Ville Skyttä 2006-08-30 15:43:10 UTC

I guess relying on Content-Type alone for the SGML/XML mode decision may not work too well with uploaded documents and/or markup submitted through a textarea.  

Maybe require the user to set the mode in those cases?  This would however have an effect on the CGI "API" (think 3rd party tools submitting docs to the validator).

Comment 10 Olivier Thereaux 2006-08-31 01:58:27 UTC

(In reply to comment #9)
> I guess relying on Content-Type alone for the SGML/XML mode decision may not
> work too well with uploaded documents and/or markup submitted through a
> textarea.  
> Maybe require the user to set the mode in those cases?  

Upload should not be much of a problem, should it? Only for .html documents which would probably be treated as html even if they are XHTML 1.0, but it may just push more people to use .xhtml and serve as application/xhtml+xml.

Direct input is an issue indeed, however, and I suppose we could have a "treat as HTML | treat as XML" radio choice. Similar in a way to what valet does, but probably more accessible to the layman than a choice of parsers.

> This would however have
> an effect on the CGI "API" (think 3rd party tools 
> submitting docs to the validator).

True. That means we need a default for the direct input mode.

Comment 11 Ville Skyttä 2006-08-31 20:43:25 UTC

Hm, if .xhtml uploads as application/xhtml+xml with recent browsers, then the situation is better than I thought.  I guess not a lot of people use the .xhtml extension at the moment, FWIW.

Regarding the API change, I generally tend to think that prominently breaking things is better than subtle, hidden, incompatible behaviour changes.  This change would fall into a grayish area though, maybe having a documented default would be acceptable, even if it could cause different results for the same submission before and after the change.

Just thinking aloud, no strong opinions.

Comment 12 Olivier Thereaux 2006-09-07 02:16:43 UTC

(In reply to comment #11)
> Hm, if .xhtml uploads as application/xhtml+xml with recent browsers, then the
> situation is better than I thought.  I guess not a lot of people use the .xhtml
> extension at the moment, FWIW.

I assumed so, but it is worth testing.


> Regarding the API change, I generally tend to think that prominently breaking
> things is better than subtle, hidden, incompatible behaviour changes.

Good point. Whatever we do, we should try and talk directly to a number of known users of the direct input (opera, mozilla's web developer extension, etc).

Comment 13 Si^mon Pi^eters 2007-01-11 05:40:48 UTC

(In reply to comment #10)
> > This would however have
> > an effect on the CGI "API" (think 3rd party tools 
> > submitting docs to the validator).
> 
> True. That means we need a default for the direct input mode.

I think in the absence of media type information, the validator should either (1) force the user to choose one before performing validation, or (2) first issue a warning about lack of media type information and then check if the character stream begins with (ignoring BOM) "<?xml", and, if so, use XML mode, otherwise SGML mode. In case of (2) the user should be able to change the mode and revalidate.

In either case it should not just say "This is valid X!" without warnings in the case of absent media type information, because it might not be true.

Comment 14 Olivier Thereaux 2007-03-27 05:26:21 UTC

I think this precise bug was fixed by changing the behavior to the following:

* read content type, if any
* if content type gives parse mode of TBD, or no content type, use HTML::Parser as SGML mode
else, preparse as XML.

(that is, switch the default to SGML mode, and use XML mode only when we are sure)
This new default will probably cause some issues with direct input ( but that's a question for Bug #1391 ) and some custom DTDs, but over all this behavior is more sane.