1762 – UTF-8 BOM in XHTML breaks CSS validator

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 1762 - UTF-8 BOM in XHTML breaks CSS validator

Summary: UTF-8 BOM in XHTML breaks CSS validator

Status:	RESOLVED FIXED

Alias:	None

Product:	CSSValidator
Classification:	Unclassified
Component:	XHTML1.0 (show other bugs)
Version:	CSS Validator
Hardware:	PC Windows XP

Importance:	P2 normal
Target Milestone:	---
Assignee:	Olivier Thereaux
QA Contact:	qa-dev tracking

URL:	http://www.w3.org/International/tests...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-07-19 22:58 UTC by Cory Nelson
Modified:	2007-06-28 00:43 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Cory Nelson 2005-07-19 22:58:50 UTC

A UTF-8 BOM in XHTML breaks the CSS validator, see the "Valid CSS" link at the
bottom of the URL provided.

Comment 1 Bj 2005-07-19 23:11:20 UTC

Indeed (in fact, that's probably a known issue, but I am not sure whether 
someone filed a bug already). We might be able to fix this by upgrading to a 
more recent version of Xerces.

Comment 2 Yves Lafon 2005-07-20 10:10:55 UTC

Xerces version is currently 2.6.2, can you check again?

Comment 3 Bj 2005-07-20 12:05:41 UTC

Okay, it seems this happens if Content-Type:text/html with no charset parameter 
and a BOM. So this is probably the result of how the HTML parser with its XHTML 
sniffing interact with xerces. The Validator might be transcoding to UTF-8 
before it passes the document to Xerces and in a character stream a bom may 
indeed not appear. It seems to work for application/xhtml+xml and text/html 
with a charset parameter in the HTTP header.

Comment 4 Cory Nelson 2005-07-20 13:14:12 UTC

that did it - declared it as utf-8 in the http header and it now works.

Comment 5 Yves Lafon 2005-07-22 10:13:10 UTC

(In reply to comment #3)
> Okay, it seems this happens if Content-Type:text/html with no charset parameter 
> and a BOM. So this is probably the result of how the HTML parser with its XHTML 
> sniffing interact with xerces. The Validator might be transcoding to UTF-8 
> before it passes the document to Xerces and in a character stream a bom may 
> indeed not appear. It seems to work for application/xhtml+xml and text/html 
> with a charset parameter in the HTTP header.

The current code does this
if the mime type has a charset parameter use it,
if not, then if the mime type is text/html -> use iso-8859-1

Comment 6 Olivier Thereaux 2007-02-09 17:45:50 UTC

changing URL to be test case on i18n web site

Comment 7 Olivier Thereaux 2007-06-28 00:43:13 UTC

switching to tagsoup library as html parser has made this issue moot.
(there are still issues with BOM-toting CSS files, but will open another bug for them)