22223 – Latin-1 characters (æ, þ etc.) are rejected as errors by validator

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 22223 - Latin-1 characters (æ, þ etc.) are rejected as errors by validator

Summary: Latin-1 characters (æ, þ etc.) are rejected as errors by validator

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	HTML Checker
Classification:	Unclassified
Component:	General (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 major
Target Milestone:	---
Assignee:	Michael[tm] Smith
QA Contact:	qa-dev tracking

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-05-31 05:17 UTC by jc ahágama
Modified:	2013-06-25 20:11 UTC (History)
CC List:	2 users (show)

See Also:

Attachments

Description jc ahágama 2013-05-31 05:17:39 UTC

An HTML page that has any characters from ISO-8859-1 character set used to validate as correct when written and tested for HTML4.1. Then when such a page was written for HTYML5 was tested, windows-1252 was advised to be used over ISO-8859-1. If there was no charset declaration, it was assumed to be of WINDOWS-1252 and passed.

Until recently, UTF-8 was encouraged to be used as charset declaration, but WINDOWS-1252 was accepted. And now the rule is enforced by issuing these errors / warnings like these:
1. Using windows-1252 instead of the declared encoding iso-8859-1.
2. Legacy encoding windows-1252 used. Documents should use UTF-8.
3. utf8 "\xE6" does not map to Unicode.

What does 3. above mean? This is a catch-22. If you declare UTF-8, it is an error because æ, þ and are outside Unicode. I thought we are talking about UTF-8 encoding of characters. How does Unicode factor in here?

RFC-3629 is very clear about how to encode ASCII and Latin-1 (SBCS) characters into UTF-8. It appears that ASCII is accepted and Latin-1 Extension is rejected for some unpublished reason.

Please check these pages to understand the problem.
http://ahangama.com/charset-iso-8859-1.htm
http://ahangama.com/charset-none.htm
http://ahangama.com/charset-utf-8.htm
http://ahangama.com/charset-windows-1252.htm

Thank you.

Comment 1 Michael[tm] Smith 2013-06-19 07:31:10 UTC

The validator behavior was changed to match the current HTML spec, which now references the Encoding standard http://encoding.spec.whatwg.org and says "Authors must use the utf-8 encoding and must use the "utf-8" label to identify it." http://encoding.spec.whatwg.org/

Comment 2 jc ahágama 2013-06-20 03:33:39 UTC

Mr. Smith,
Thank you for the reply. I read the page you gave.

Someone closed the issue without testing the issue.

I think what they mean by "Authors must use the utf-8 encoding" is that authors must declare UTF-8 as 'charset'. Am I right? (I have been writing web pages since 90's and I believe that I can understand the technical background here).

I *want* to follow standards. The problem is when I declare UTF-8, meaning use it for encoding, the browser shows the place-holders for the codepoints and the Validator says, "Error found while checking this document as HTML5!"

Please plug in the following page to the Validator (at validator.w3.org):
http://ahangama.com/charset-utf-8.htm

The error is explained thus:
"Sorry, I am unable to validate this document because on line 20 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

The error was: utf8 "\xE6" does not map to Unicode"

U00E6 is the Old English  letter Ash (æ). It is found in the following Unicode block:
http://www.unicode.org/charts/PDF/U0080.pdf

Clearly, Validator is wrong and has to be fixed.

Comment 3 Michael[tm] Smith 2013-06-20 04:09:58 UTC

(In reply to comment #2)
> I think what they mean by "Authors must use the utf-8 encoding" is that
> authors must declare UTF-8 as 'charset'. Am I right?

No. It means exactly what it says: The contents must be encoded in utf-8.

> I *want* to follow standards. The problem is when I declare UTF-8, meaning
> use it for encoding,

Declaring it by putting a meta@charset element in a file does not magically set the actual encoding to utf-8. You have to actually encode the contents in utf-8.

> the browser shows the place-holders for the codepoints
> and the Validator says, "Error found while checking this document as HTML5!"
> 
> Please plug in the following page to the Validator (at validator.w3.org):
> http://ahangama.com/charset-utf-8.htm

That file is not encoded in utf-8. It's encoded in iso-8859-1, which is something very different from utf-8. The <meta http-equiv="content-type" content="text/html; charset=utf-8" /> element you have in there does not change the encoding of the file; all it does is, it makes a browser try to process it as utf-8 in spite of the fact that it's actually encoded in iso-8859-1. So the the browser ends up displaying replacement characters for some of the code points instead of showing the correct glyphs.

If you manually switch the encoding setting for that page in your browser to iso-8859-1, the characters will display in your browser as expected. 

> Clearly, Validator is wrong and has to be fixed.

There's nothing wrong with the validator. The problem is that you don't actually have that file encoded in utf-8. You need to figure out how to actually encode it in utf-8 in whatever editor you're using, and then try again.

Comment 4 Michael[tm] Smith 2013-06-20 04:17:35 UTC

By the way, for checking HTML5 files, I recommend you use http://validator.w3.org/nu/ instead of http://validator.w3.org/

Comment 5 jc ahágama 2013-06-25 20:11:20 UTC

I understand now. Thank you for explaining so clearly. My fault.

My problem has been that I still use HTML-Kit that has no choices of encoding types. I suspect the new ruling will upgrade from SHOULD to MUST forcing files like what I write (and all Western European pages) to be larger, unnecessarily taking up precious bandwidth. I think Latin Basic and Latin-1 Supplement can both go as safe single-byte if the first 32 characters of the latter are prohibited or windows-1252 allowed instead.

Anyway, I put the shown file into Notepad and saved it as utf-8 to get the passing page. Thank you. And thanks for telling me to use the NU page which is friendlier:
http://hathvenibalavegaya.com/index.htm <== utf-8 (20,038 bytes) 
http://hathvenibalavegaya.com/indexOld.htm <== windows-1252 (18,633 bytes)

I beg the great technocrats to allow windows-1252 for the sake of the poor, like those living in Sri Lanka, and the public network that has lot of Western European documents. <-- My proposed patch of penny's worth.