This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 4867 - override encoding info in XML prolog to reflect transcoding
Summary: override encoding info in XML prolog to reflect transcoding
Alias: None
Product: Validator
Classification: Unclassified
Component: Parser (show other bugs)
Version: 0.8.0b2
Hardware: PC Windows XP
: P2 major
Target Milestone: 0.8.0
Assignee: This bug has no owner yet - up for the taking
QA Contact: qa-dev tracking
Depends on:
Reported: 2007-07-19 01:00 UTC by Masataka Yakura
Modified: 2007-07-19 06:22 UTC (History)
0 users

See Also:


Description Masataka Yakura 2007-07-19 01:00:13 UTC
There seems to be a bug in the new XML parser. It doesn't recognize some Japanese encodings other than UTF, such as Shift_JIS, EUC-JP.

Try validating , you'll see some XML errors. But try saving the page in an XML format (mitsue.xml) and opening it in Firefox and Internet Explorer, I got no such errors. Rewrite the source substituting "shift_jis" for "UTF-8" and it will validate. Thus, the validator seems to have some encoding detection and handling issues.

There are so many webpages with Shift_JIS or EUC-JP or whatever non-UTF. I'm afraid that launching the new validator without fixing that issue would cause serious confusion in Japanese market.
Comment 1 Olivier Thereaux 2007-07-19 05:06:08 UTC
Nice catch Masataka, thanks a lot.

I found out that the problem was with 
<?xml version="1.0" encoding="Shift_JIS"?>
which causes the XML parser to read the XML content as shift-jis, even though the validator systematically transcodes everything to UTF-8 without passing it to the different parsers.

I'm looking at whether I can tell the XML parser to ignore the encoding="..." or whether I should be rewriting the value to be UTF-8.
Comment 2 Olivier Thereaux 2007-07-19 06:22:23 UTC
Fixed with a regexp, which should cover pretty much all reasonable cases.