16768 – Update HTML to make use of the Encoding Standard

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 16768 - Update HTML to make use of the Encoding Standard

Summary: Update HTML to make use of the Encoding Standard

Status:	RESOLVED FIXED

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Silvia Pfeiffer
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Duplicates (1):	17151 (view as bug list)
Depends on:
Blocks:

Reported:	2012-04-18 07:54 UTC by Anne
Modified:	2013-02-01 00:48 UTC (History)
CC List:	9 users (show)

See Also:

Attachments

Description Anne 2012-04-18 07:54:33 UTC

The IANA registry is unbounded, does not match implementations when it comes to encodings and their labels, does not detail extensions to encodings that need to be supported, does not detail error handling for encodings; it is inadequate per today's standards. http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html was written to solve this problem and using it in HTML we can simplify the following:

* Instead of "preferred MIME name" we can now talk about "name" of the "encoding".
* "ASCII-compatible character encoding" is no longer needed as only utf-16 and utf-16be are incompatible per the restricted list.
* The "decode a byte string as UTF-8, with error handling" algorithm can be removed in favor of using "utf-8 decode" which has the correct error handling (should be identical).
* For encoding (URLs and <form>) a custom "encoder error" needs to be defined, by returning from the decoder algorithm and feeding it the intended replacement characters. (You do not know in advance which code points cannot be encoded.)
* In the suggested default encoding list the encoding names can be updated to use the canonical name rather than a label.
* Misinterpreted for compatibility is no longer needed and the encoding overrides table can also be removed.

Comment 1 Jirka Kosek 2012-04-18 11:05:14 UTC

Hi Anne,

thanks for draft.

Where are labels coming from? I'm asking because if the aim of spec is to handle legacy content then additional labels should be added. For example windows-1250 was sometimes referred as cp1250 and you will find plenty of such pages in the wild.

Jirka

Comment 2 Anne 2012-04-18 11:31:55 UTC

The current draft is indeed rather conservative when it comes to single-byte labels (IE is the only browser that does not recognize that label as far as I can tell). I filed bug 16773 to change that.

Comment 3 Anne 2012-05-23 07:52:49 UTC

*** Bug 17151 has been marked as a duplicate of this bug. ***

Comment 4 contributor 2012-07-18 07:00:25 UTC

This bug was cloned to create bug 17839 as part of operation convergence.

Comment 5 Silvia Pfeiffer 2013-02-01 00:48:26 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If
you are satisfied with this response, please change the state of
this bug to CLOSED. If you have additional information and would
like the Editor to reconsider, please reopen this bug. If you would
like to escalate the issue to the full HTML Working Group, please
add the TrackerRequest keyword to this bug, and suggest title and
text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this
document:   http://dev.w3.org/html5/decision-policy/decision-policy-v2.html

Status: Accepted

Change Description: https://github.com/w3c/html/commit/3d85bac87355240a433865ec56074a80c33a271d

Rationale: IANACHARSET does not match implementation - the WHATWG encoding specification is a much better reference

Open Questions:

Is there a version of http://encoding.spec.whatwg.org/ published at the W3C?

Remaining questions of Ian in related bug https://www.w3.org/Bugs/Public/show_bug.cgi?id=17839#c1 :

> Do you flag people using bytes that aren't compatible between ISO-8859-1 and 
> Win1252 as a conformance error anywhere, or are we just saying ISO-8859-1 is 
> bogus and these are the new tables, end of story?
>
> I've left references to "ASCII-compatible character encoding" for now; is it
> not still plausible that people are using EBCDIC mainframes and implementing
> HTML parsers for them?
>
> The "utf-8 decode" and "decode" algorithms are too clever for HTML's use, so I
> just directly use the relevant decoder algorithms. "encode" doesn't seem to add
> anything useful vs "encoder", either.
>
>> (You do not know in advance which code points cannot be encoded.)
>
>Can you elaborate on this?"