<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>14680</bug_id>
          
          <creation_ts>2011-11-02 16:26:18 +0000</creation_ts>
          <short_desc>Using windows-1252 instead of the declared encoding iso-8859-1</short_desc>
          <delta_ts>2015-08-23 07:07:26 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>HTML Checker</product>
          <component>General</component>
          <version>unspecified</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Windows XP</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>WORKSFORME</resolution>
          
          
          <bug_file_loc>http://www.lovatasinhala.com/iso-8859-1-l.htm</bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P3</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="matteo">matteosistisette</reporter>
          <assigned_to name="Michael[tm] Smith">mike+validator</assigned_to>
          <cc>ian</cc>
    
    <cc>mike</cc>
    
    <cc>transoral</cc>
          
          <qa_contact name="qa-dev tracking">www-validator-cvs</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>59504</commentid>
    <comment_count>0</comment_count>
    <who name="matteo">matteosistisette</who>
    <bug_when>2011-11-02 16:26:18 +0000</bug_when>
    <thetext>I&apos;m trying to validate (by url) a page that is encoded with iso-8859-1 character encoding and is declared as iso-8859-1 character encoding. It is html5. The encoding is being declared in the http Content-Type header properly and corrsepond to the encoding actually used AND the one declared in the html.

If I choose &quot;detect automatically&quot; as both the character encoding and the doctype, I get this warning:

Using windows-1252 instead of the declared encoding iso-8859-1

First of all it is unclear: Is it telling me that the page is encoded with a different encoding than the one declared, or is it telling me that the validator is using a different encoding to decode it?

In both cases it doesn&apos;t make sense. In the first case, it is plain wrong, because the page IS encoded with iso-8859-1.

In the second case, then tha automatic detection doesn&apos;t work properly.

At the top of the validation page, it says &quot;encoding: iso-8859-1&quot; and &quot;doctype: html5&quot;, so it looks like it is the first hypothesis, then the error message is bogus.


If I select manually iso-8859-1 instead of detect automatically, then the warning disappears and everything look correct (i do get errors but that&apos;s ok because the page does have errors).




I can&apos;t provide a link to the page, but here&apos;s the first part of the content:

&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;
  &lt;meta charset=&quot;iso-8859-1&quot;&gt;
  &lt;title&gt;XXXX: Est</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>71826</commentid>
    <comment_count>1</comment_count>
    <who name="">transoral</who>
    <bug_when>2012-08-03 18:32:41 +0000</bug_when>
    <thetext>I have this exact situation. The sample web page (HTML5) is
http://www.lovatasinhala.com/iso-8859-1-l.htm
I went ahead and included windows-1252 as the character encoding in the HTTP header (via .htacess file). However, the point is that this page does not have any characters outside iso-8859-1, and yet the validator says it is windows-1252 when it is only iso-8859-1.

As far as I see the difference between is-8859-1 and windows-1252 is that windows-1252 has the following characters in addition to iso-8859-1. 
The first column is the single-byte code and the third is the corresponding Unicode codepoint.
80 	€ 	20AC 	EURO SIGN
82 	‚ 	201A 	SINGLE LOW-9 QUOTATION MARK
83 	ƒ 	0192 	LATIN SMALL LETTER F WITH HOOK
84 	„ 	201E 	DOUBLE LOW-9 QUOTATION MARK
85 	… 	2026 	HORIZONTAL ELLIPSIS
86 	† 	2020 	DAGGER
87 	‡ 	2021 	DOUBLE DAGGER
88 	ˆ 	02C6 	MODIFIER LETTER CIRCUMFLEX ACCENT
89 	‰ 	2030 	PER MILLE SIGN
8A 	Š 	0160 	LATIN CAPITAL LETTER S WITH CARON
8B 	‹ 	2039 	SINGLE LEFT-POINTING ANGLE QUOTATION MARK
8C 	Œ 	0152 	LATIN CAPITAL LIGATURE OE
8E 	Ž 	017D 	LATIN CAPITAL LETTER Z WITH CARON
91 	‘ 	2018 	LEFT SINGLE QUOTATION MARK
92 	’ 	2019 	RIGHT SINGLE QUOTATION MARK
93 	“ 	201C 	LEFT DOUBLE QUOTATION MARK
94 	” 	201D 	RIGHT DOUBLE QUOTATION MARK
95 	• 	2022 	BULLET
96 	– 	2013 	EN DASH
97 	— 	2014 	EM DASH
98 	˜ 	02DC 	SMALL TILDE
99 	™ 	2122 	TRADE MARK SIGN
9A 	š 	0161 	LATIN SMALL LETTER S WITH CARON
9B 	› 	203A 	SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
9C 	œ 	0153 	LATIN SMALL LIGATURE OE
9E 	ž 	017E 	LATIN SMALL LETTER Z WITH CARON
9F 	Ÿ 	0178 	LATIN CAPITAL LETTER Y WITH DIAERESIS</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>86436</commentid>
    <comment_count>2</comment_count>
    <who name="Michael[tm] Smith">mike</who>
    <bug_when>2013-04-21 01:34:48 +0000</bug_when>
    <thetext>I believe that the validator behavior here conforms to the requirements in the HTML5 spec. So if you want to discuss those requirements and/or suggest that they be changed, the best places to have that discussion are public-html@w3.org and whatwg@whatwg.org</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>86453</commentid>
    <comment_count>3</comment_count>
    <who name="matteo">matteosistisette</who>
    <bug_when>2013-04-21 13:10:29 +0000</bug_when>
    <thetext>@Michael[tm] Smith I don&apos;t think you read the report carefully.

The page is declared as ISO-8859-1 both in the headers and in the meta tag. Why on earth should it be detected as another encoding?!?!?!?!?

Where does the html5 standard say such an absurdity?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>86454</commentid>
    <comment_count>4</comment_count>
    <who name="matteo">matteosistisette</who>
    <bug_when>2013-04-21 13:11:07 +0000</bug_when>
    <thetext>Also note the part about the unclearness of the message.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>86908</commentid>
    <comment_count>5</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2013-04-26 23:21:33 +0000</bug_when>
    <thetext>matteo, see http://encoding.spec.whatwg.org/ for context</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>