This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 1762 - UTF-8 BOM in XHTML breaks CSS validator
Summary: UTF-8 BOM in XHTML breaks CSS validator
Status: RESOLVED FIXED
Alias: None
Product: CSSValidator
Classification: Unclassified
Component: XHTML1.0 (show other bugs)
Version: CSS Validator
Hardware: PC Windows XP
: P2 normal
Target Milestone: ---
Assignee: Olivier Thereaux
QA Contact: qa-dev tracking
URL: http://www.w3.org/International/tests...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-07-19 22:58 UTC by Cory Nelson
Modified: 2007-06-28 00:43 UTC (History)
0 users

See Also:


Attachments

Description Cory Nelson 2005-07-19 22:58:50 UTC
A UTF-8 BOM in XHTML breaks the CSS validator, see the "Valid CSS" link at the
bottom of the URL provided.
Comment 1 Bj 2005-07-19 23:11:20 UTC
Indeed (in fact, that's probably a known issue, but I am not sure whether 
someone filed a bug already). We might be able to fix this by upgrading to a 
more recent version of Xerces.
Comment 2 Yves Lafon 2005-07-20 10:10:55 UTC
Xerces version is currently 2.6.2, can you check again?
Comment 3 Bj 2005-07-20 12:05:41 UTC
Okay, it seems this happens if Content-Type:text/html with no charset parameter 
and a BOM. So this is probably the result of how the HTML parser with its XHTML 
sniffing interact with xerces. The Validator might be transcoding to UTF-8 
before it passes the document to Xerces and in a character stream a bom may 
indeed not appear. It seems to work for application/xhtml+xml and text/html 
with a charset parameter in the HTTP header.
Comment 4 Cory Nelson 2005-07-20 13:14:12 UTC
that did it - declared it as utf-8 in the http header and it now works.
Comment 5 Yves Lafon 2005-07-22 10:13:10 UTC
(In reply to comment #3)
> Okay, it seems this happens if Content-Type:text/html with no charset parameter 
> and a BOM. So this is probably the result of how the HTML parser with its XHTML 
> sniffing interact with xerces. The Validator might be transcoding to UTF-8 
> before it passes the document to Xerces and in a character stream a bom may 
> indeed not appear. It seems to work for application/xhtml+xml and text/html 
> with a charset parameter in the HTTP header.

The current code does this
if the mime type has a charset parameter use it,
if not, then if the mime type is text/html -> use iso-8859-1
Comment 6 Olivier Thereaux 2007-02-09 17:45:50 UTC
changing URL to be test case on i18n web site
Comment 7 Olivier Thereaux 2007-06-28 00:43:13 UTC
switching to tagsoup library as html parser has made this issue moot.
(there are still issues with BOM-toting CSS files, but will open another bug for them)