This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 14680 - Using windows-1252 instead of the declared encoding iso-8859-1
Summary: Using windows-1252 instead of the declared encoding iso-8859-1
Status: RESOLVED WORKSFORME
Alias: None
Product: HTML Checker
Classification: Unclassified
Component: General (show other bugs)
Version: unspecified
Hardware: PC Windows XP
: P3 normal
Target Milestone: ---
Assignee: Michael[tm] Smith
QA Contact: qa-dev tracking
URL: http://www.lovatasinhala.com/iso-8859...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-11-02 16:26 UTC by matteo
Modified: 2015-08-23 07:07 UTC (History)
3 users (show)

See Also:


Attachments

Description matteo 2011-11-02 16:26:18 UTC
I'm trying to validate (by url) a page that is encoded with iso-8859-1 character encoding and is declared as iso-8859-1 character encoding. It is html5. The encoding is being declared in the http Content-Type header properly and corrsepond to the encoding actually used AND the one declared in the html.

If I choose "detect automatically" as both the character encoding and the doctype, I get this warning:

Using windows-1252 instead of the declared encoding iso-8859-1

First of all it is unclear: Is it telling me that the page is encoded with a different encoding than the one declared, or is it telling me that the validator is using a different encoding to decode it?

In both cases it doesn't make sense. In the first case, it is plain wrong, because the page IS encoded with iso-8859-1.

In the second case, then tha automatic detection doesn't work properly.

At the top of the validation page, it says "encoding: iso-8859-1" and "doctype: html5", so it looks like it is the first hypothesis, then the error message is bogus.


If I select manually iso-8859-1 instead of detect automatically, then the warning disappears and everything look correct (i do get errors but that's ok because the page does have errors).




I can't provide a link to the page, but here's the first part of the content:

<!DOCTYPE html>
<html>
<head>
  <meta charset="iso-8859-1">
  <title>XXXX: Est
Comment 1 transoral 2012-08-03 18:32:41 UTC
I have this exact situation. The sample web page (HTML5) is
http://www.lovatasinhala.com/iso-8859-1-l.htm
I went ahead and included windows-1252 as the character encoding in the HTTP header (via .htacess file). However, the point is that this page does not have any characters outside iso-8859-1, and yet the validator says it is windows-1252 when it is only iso-8859-1.

As far as I see the difference between is-8859-1 and windows-1252 is that windows-1252 has the following characters in addition to iso-8859-1. 
The first column is the single-byte code and the third is the corresponding Unicode codepoint.
80 	€ 	20AC 	EURO SIGN
82 	‚ 	201A 	SINGLE LOW-9 QUOTATION MARK
83 	ƒ 	0192 	LATIN SMALL LETTER F WITH HOOK
84 	„ 	201E 	DOUBLE LOW-9 QUOTATION MARK
85 	… 	2026 	HORIZONTAL ELLIPSIS
86 	† 	2020 	DAGGER
87 	‡ 	2021 	DOUBLE DAGGER
88 	ˆ 	02C6 	MODIFIER LETTER CIRCUMFLEX ACCENT
89 	‰ 	2030 	PER MILLE SIGN
8A 	Š 	0160 	LATIN CAPITAL LETTER S WITH CARON
8B 	‹ 	2039 	SINGLE LEFT-POINTING ANGLE QUOTATION MARK
8C 	Π	0152 	LATIN CAPITAL LIGATURE OE
8E 	Ž 	017D 	LATIN CAPITAL LETTER Z WITH CARON
91 	‘ 	2018 	LEFT SINGLE QUOTATION MARK
92 	’ 	2019 	RIGHT SINGLE QUOTATION MARK
93 	“ 	201C 	LEFT DOUBLE QUOTATION MARK
94 	” 	201D 	RIGHT DOUBLE QUOTATION MARK
95 	• 	2022 	BULLET
96 	– 	2013 	EN DASH
97 	— 	2014 	EM DASH
98 	˜ 	02DC 	SMALL TILDE
99 	™ 	2122 	TRADE MARK SIGN
9A 	š 	0161 	LATIN SMALL LETTER S WITH CARON
9B 	› 	203A 	SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
9C 	œ 	0153 	LATIN SMALL LIGATURE OE
9E 	ž 	017E 	LATIN SMALL LETTER Z WITH CARON
9F 	Ÿ 	0178 	LATIN CAPITAL LETTER Y WITH DIAERESIS
Comment 2 Michael[tm] Smith 2013-04-21 01:34:48 UTC
I believe that the validator behavior here conforms to the requirements in the HTML5 spec. So if you want to discuss those requirements and/or suggest that they be changed, the best places to have that discussion are public-html@w3.org and whatwg@whatwg.org
Comment 3 matteo 2013-04-21 13:10:29 UTC
@Michael[tm] Smith I don't think you read the report carefully.

The page is declared as ISO-8859-1 both in the headers and in the meta tag. Why on earth should it be detected as another encoding?!?!?!?!?

Where does the html5 standard say such an absurdity?
Comment 4 matteo 2013-04-21 13:11:07 UTC
Also note the part about the unclearness of the message.
Comment 5 Ian 'Hixie' Hickson 2013-04-26 23:21:33 UTC
matteo, see http://encoding.spec.whatwg.org/ for context