13771 – Encodings 'misinterpreted for compatibility' should risk fatal error in XHTML

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 13771 - Encodings 'misinterpreted for compatibility' should risk fatal error in XHTML

Summary: Encodings 'misinterpreted for compatibility' should risk fatal error in XHTML

Status:	RESOLVED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P3 major
Target Milestone:	---
Assignee:	contributor
QA Contact:	HTML WG Bugzilla archive list

URL:	http://www.w3.org/TR/html5/parsing#ta...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-08-13 17:58 UTC by Leif Halvard Silli
Modified:	2011-08-14 19:13 UTC (History)
CC List:	6 users (show)

See Also:

Attachments

Description Leif Halvard Silli 2011-08-13 17:58:58 UTC

REQUEST:

State that HTML5's table over encoding overrides is not to be adhered to by XHTML parsers, and that - unless they do deviate from the encoding overrides table, they effectively do not support those encodings and thus are required to emit a fatal error whenever they stumble upon such labels.

BACKGROUND:

HTML5 keeps a table over encoding labels that should be 'misinterpreted for compatibility' - see <http://www.w3.org/TR/html5/parsing#table-encoding-overrides>. For instance 'US-ASCII' as well as 'ISO-8859-1' should be treated as 'windows-1252'.

By contrast, XML 1.0  operates with the following rule: 

   ]] It is a fatal error when an XML processor encounters 
      an entity with an encoding that it is unable to process [[ 
      <http://www.w3.org/TR/xml/#charencoding>

Thus: An XML parser is required to know, before it parses the page, which encodings it supports. Hence, if an XHTML parser in an Web browser knows that it does not support US-ASCII or ISO-8859-1 (because it always instead misintpretes each of them as WINDOWS-1252), then that parser does not support whether US-ASCII or ISO-8859-1.

WEB BROWSERS REALITY:

Fact is, that Firefox, Webkit and Opera (don't know about IE9) currenlty fail to emit fatal error for when an XHTML page is labelled as US-ASCII or ISO-8859-1 but contains directly typed WINDOWS-1252 legal characters. Thus, their XML parsers do currently not not support the US-ASCII or ISO-8859-1 encodings. (And according to XML 1.0 they are also not required to support them!) This violation of XML 1.0 must thus lead to fatal error. 

TEST PAGES:

Test page, US-ASCII labelled (originally UTF-8 encoded) page with illegal characters:
     http://malform.no/testing/html5/bom/normal-XML-ascii-encoding
Test page, ISO-8859-1 labelled with Windows-1252 characters: 
     http://malform.no/testing/html5/bom/normal-XML-iso88591


JUSTIFICATION:

Since Web browsers fail to adhere to this XML 1.0 rule, and because HTML5 claims to cover both XHTML and HMTL, HTML5 should specify that the encoding override rules in fact only counts for HTML parses. 


BENEFITS:

If the XHTML parsers inside Web browsers start to emit the required fatal errors, then it will further strengthen the trend towards UTF-8.

Comment 1 Anne 2011-08-13 18:05:27 UTC

I think this is actually a feature of those browsers. Having less encodings is a benefit. Not having to support US-ASCII on its own but always treating it as Windows-1252 is how the specifications should turn out to be in the end.

Comment 2 Leif Halvard Silli 2011-08-13 18:24:49 UTC

Internet Explorer 9:

IE9 could be said to behave slightly better than the others: If you contrast
the ISO-8859-1 test page
(<http://malform.no/testing/html5/bom/normal-XML-iso88591>) with an UTF-8
encoded reference page
(<http://malform.no/testing/html5/bom/normal-XML-iso88591utfeight>), then it
turns out that IE9 for the ISO-8859-1 encoded page silently drops the WIN-1252
specific characters.

Of course, IE9 should have emitted a fatal error. But at least its deviation
from both HTML5 and XML 1.0 adds yet another reason to support XML 1.0's
requirement to display a fatal error.

Btw, when it comes to the US-ASCII teste page
(<http://malform.no/testing/html5/bom/normal-XML-ascii-encoding>) then IE9 (in
Adobe's browserlab and in www.netrenderer.de) fails to display it, which I
guess counts as a fatal error.

Comment 3 Leif Halvard Silli 2011-08-13 18:51:19 UTC

(In reply to comment #2)

But it should be said that IE6 to IE9 fails to treat US-ASCII as WINDOWS-1252 even for HTML files:

http://malform.no/testing/html5/bom/normal-HTML-ascii-encoding

Comment 4 Leif Halvard Silli 2011-08-13 18:52:23 UTC

(In reply to comment #1)
> I think this is actually a feature of those browsers.

Could be interesting to know why you think so given what you state in the following sentences.


> Having less encodings is a benefit.

Indeed. But that sounds like an argument in favour of emitting fatal error - as required by XML 1.0 - whenever the browser stumbles upon an encoding label for an encoding that it does not support.


> Not having to support US-ASCII on its own but always treating it as
> Windows-1252 is how the specifications should turn out to be in the end.

That *could* be be interpreted as an argument for not supporting whether US-ASCII or WINDOWS-1252 in XML parsers.

But if you seriously thinks that US-ASCII should become a legal alias for WINDOWS-1252 in XML then I guess relevant bugs filed against XML 1.0 and - I suppose - the IANA registry, will soon be filed - unless it has happened already?

Comment 5 Anne 2011-08-14 08:37:27 UTC

The long term goal is to replace the registry: http://wiki.whatwg.org/wiki/Web_Encodings

Comment 6 Leif Halvard Silli 2011-08-14 12:06:02 UTC

(In reply to comment #5)
That page does not mention XML.

For HTML, the concept of 'fatal error' doesn't exist. As such, when the charset doesn't match, one must follow a different strategy from that of XML. And Win-1252 make some sense, because this will also for the most part avoid questions marks for unmappable characters - et cetera.

However, there is no guarantee that a "US-ASCII" labelled page actually is Win-1252 encoded. For example, it could just as well be UTF-8 encoded. It could even be US-ASCII encoded ...

Btw,  I forgot to quote this, as part of the justification for this bug:

]]
 It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding.
[[

Comment 7 Anne 2011-08-14 12:08:07 UTC

Encodings and their labels apply to HTML, XML, CSS, etc.

Comment 8 Anne 2011-08-14 12:09:22 UTC

To make it even more clear, us-ascii will become a label for the windows-1252 encoding.

Comment 9 Leif Halvard Silli 2011-08-14 13:57:01 UTC

(In reply to comment #7)
> Encodings and their labels apply to HTML, XML, CSS, etc.

That page does not talk about 'HTML, XML, CSS etc' - it talks about "Web" and "HTML".


(In reply to comment #8)
> To make it even more clear, us-ascii will become a label for the windows-1252
> encoding.

Such a thing is probably not very useful for XML. Moreover, it seems smartest to forbid the "US-ASCII" label, if it isn't supposed to have the intended meaning anymore.


BTW, Validator.nu behaves as requested by this bug: http://validator.nu/?doc=http%3A%2F%2Fmalform.no%2Ftesting%2Fhtml5%2Fbom%2Fascii

Comment 10 Anne 2011-08-14 14:00:57 UTC

Feel free to fix the page if you think it is not clear. And it is useful that all technologies behave the same when it comes to encoding labels. Having differences for legacy cruft is not needed. Better to be consistent.

Comment 11 Leif Halvard Silli 2011-08-14 18:02:31 UTC

(In reply to comment #10)
> Feel free to fix the page if you think it is not clear. And it is useful that
> all technologies behave the same when it comes to encoding labels. Having
> differences for legacy cruft is not needed. Better to be consistent.

I think it can work OK w.r.t. such a longtime goal to implement this bug.

XML parsers are not required to support more than UTF-8 and UTF-16.  If removing support for the "US-ASCII" label means removing support for Windows-1252 (and its misinterpreted aliases)  then this would mean that the XML parsers of the Web browses would not need fall for the temptation to interpret UTF-8 as WIN-1252 (like e.g. Opera and Webkit under some conditions currently do).

It is not like the future of the Web is Windows-1252 - the future is UTF-8. And XML on the web is rather rare - and non-UTF-8 XML is probably even much rarer.

Whenever the longterm goal you describe is reached, the Web browser's XML parsers, can consider enabling support for Win-1252 again, should they want to do that.

Comment 12 Ian 'Hixie' Hickson 2011-08-14 19:12:53 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: The whole section to which this bug refers is specifically in the HTML section of the spec and has nothing to do with XML. How to parse XML is out of scope for the HTML spec. I recommend raising this with the XML working group if something needs to change.