This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 690 - Unusual processing instructions confuse the validator
Summary: Unusual processing instructions confuse the validator
Status: RESOLVED LATER
Alias: None
Product: Validator
Classification: Unclassified
Component: check (show other bugs)
Version: 0.6.1
Hardware: PC Linux
: P5 minor
Target Milestone: 1.0
Assignee: Terje Bless
QA Contact:
URL: http://www.ltg.ed.ac.uk/~ht/xx.html
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-04-26 07:43 UTC by Henry S. Thompson
Modified: 2004-05-16 08:21 UTC (History)
0 users

See Also:


Attachments

Description Henry S. Thompson 2004-04-26 07:43:09 UTC
This doc't is valid per SP 1.3.4, but the online validator complains it
can't detect an encoding.

I suspect the charset sniffer is not getting past the PIs at the beginning -- if
I remove them it's happy

[Note I have no idea what version I'm using -- neither the form nor the result
page gives a version number that I can see -- sorry if I'm missing something
obvious]
Comment 1 Terje Bless 2004-04-26 23:59:50 UTC
That URL is is 404 Compliant.
Comment 2 Henry S. Thompson 2004-04-27 04:03:23 UTC
Sorry, my screw-up, it's in place now
Comment 3 Terje Bless 2004-05-16 04:21:39 UTC
I think this is a case of "Don't Do That Then". The charset sniffer is groping
around for a <meta> charset because there isn't one in the Content-Type (Strike
#1), and fails to find it because at this stage we're using a non-SGML parser
(Perl's HTML::Parser) which can handle weird constructs prior to the information
it's after (Strike #2).

The only way setting encoding info inside the document can ever work is if you
take extreme care to make the bytes up to that point be easily parsed. This
includes avoiding any non-vanilla constructs and making sure whatever encoding
it's in looks identical to US-ASCII up to that point.


I don't think we're going to be able to fix this without some fairly elaborate
digging inside HTML::Parser's guts, or by using OpenSP and doing a two-pass
parse. Given the overhead and the low gain, I'm not sure it's worth it for this
bug alone (but it's another point in favour of doing a two-pass parse).

Resolving as "LATER", and setting Target to 1.0 (aka. "Once upon a time...").

Thanks for the catch Henry!