690 2004-04-26 07:43:09 +0000 Unusual processing instructions confuse the validator 2004-05-16 08:21:39 +0000 1 1 1 Unclassified Validator check 0.6.1 PC Linux RESOLVED LATER http://www.ltg.ed.ac.uk/~ht/xx.html P5 minor 1.0 1 ht link oldest_to_newest 1740 0 ht 2004-04-26 07:43:09 +0000 This doc't is valid per SP 1.3.4, but the online validator complains it can't detect an encoding. I suspect the charset sniffer is not getting past the PIs at the beginning -- if I remove them it's happy [Note I have no idea what version I'm using -- neither the form nor the result page gives a version number that I can see -- sorry if I'm missing something obvious] 1771 1 link 2004-04-26 23:59:50 +0000 That URL is is 404 Compliant. 1772 2 ht 2004-04-27 04:03:23 +0000 Sorry, my screw-up, it's in place now 1836 3 link 2004-05-16 04:21:39 +0000 I think this is a case of "Don't Do That Then". The charset sniffer is groping around for a <meta> charset because there isn't one in the Content-Type (Strike #1), and fails to find it because at this stage we're using a non-SGML parser (Perl's HTML::Parser) which can handle weird constructs prior to the information it's after (Strike #2). The only way setting encoding info inside the document can ever work is if you take extreme care to make the bytes up to that point be easily parsed. This includes avoiding any non-vanilla constructs and making sure whatever encoding it's in looks identical to US-ASCII up to that point. I don't think we're going to be able to fix this without some fairly elaborate digging inside HTML::Parser's guts, or by using OpenSP and doing a two-pass parse. Given the overhead and the low gain, I'm not sure it's worth it for this bug alone (but it's another point in favour of doing a two-pass parse). Resolving as "LATER", and setting Target to 1.0 (aka. "Once upon a time..."). Thanks for the catch Henry!