Bugzilla – Bug 16768
Update HTML to make use of the Encoding Standard
Last modified: 2013-02-01 00:48:26 UTC
The IANA registry is unbounded, does not match implementations when it comes to encodings and their labels, does not detail extensions to encodings that need to be supported, does not detail error handling for encodings; it is inadequate per today's standards. http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html was written to solve this problem and using it in HTML we can simplify the following:
* Instead of "preferred MIME name" we can now talk about "name" of the "encoding".
* "ASCII-compatible character encoding" is no longer needed as only utf-16 and utf-16be are incompatible per the restricted list.
* The "decode a byte string as UTF-8, with error handling" algorithm can be removed in favor of using "utf-8 decode" which has the correct error handling (should be identical).
* For encoding (URLs and <form>) a custom "encoder error" needs to be defined, by returning from the decoder algorithm and feeding it the intended replacement characters. (You do not know in advance which code points cannot be encoded.)
* In the suggested default encoding list the encoding names can be updated to use the canonical name rather than a label.
* Misinterpreted for compatibility is no longer needed and the encoding overrides table can also be removed.
thanks for draft.
Where are labels coming from? I'm asking because if the aim of spec is to handle legacy content then additional labels should be added. For example windows-1250 was sometimes referred as cp1250 and you will find plenty of such pages in the wild.
The current draft is indeed rather conservative when it comes to single-byte labels (IE is the only browser that does not recognize that label as far as I can tell). I filed bug 16773 to change that.
*** Bug 17151 has been marked as a duplicate of this bug. ***
This bug was cloned to create bug 17839 as part of operation convergence.
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If
you are satisfied with this response, please change the state of
this bug to CLOSED. If you have additional information and would
like the Editor to reconsider, please reopen this bug. If you would
like to escalate the issue to the full HTML Working Group, please
add the TrackerRequest keyword to this bug, and suggest title and
text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this
Change Description: https://github.com/w3c/html/commit/3d85bac87355240a433865ec56074a80c33a271d
Rationale: IANACHARSET does not match implementation - the WHATWG encoding specification is a much better reference
Is there a version of http://encoding.spec.whatwg.org/ published at the W3C?
Remaining questions of Ian in related bug https://www.w3.org/Bugs/Public/show_bug.cgi?id=17839#c1 :
> Do you flag people using bytes that aren't compatible between ISO-8859-1 and
> Win1252 as a conformance error anywhere, or are we just saying ISO-8859-1 is
> bogus and these are the new tables, end of story?
> I've left references to "ASCII-compatible character encoding" for now; is it
> not still plausible that people are using EBCDIC mainframes and implementing
> HTML parsers for them?
> The "utf-8 decode" and "decode" algorithms are too clever for HTML's use, so I
> just directly use the relevant decoder algorithms. "encode" doesn't seem to add
> anything useful vs "encoder", either.
>> (You do not know in advance which code points cannot be encoded.)
>Can you elaborate on this?"