This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 21088 - Spec repeats potential Gecko bugs about encoding defaults as the truth
Summary: Spec repeats potential Gecko bugs about encoding defaults as the truth
Status: RESOLVED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: HTML5 spec (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal
Target Milestone: ---
Assignee: Robin Berjon
QA Contact: HTML WG Bugzilla archive list
URL:
Whiteboard:
Keywords: CR
Depends on: 21087
Blocks:
  Show dependency treegraph
 
Reported: 2013-02-22 13:18 UTC by Henri Sivonen
Modified: 2013-09-16 22:58 UTC (History)
4 users (show)

See Also:


Attachments

Description Henri Sivonen 2013-02-22 13:18:38 UTC
See http://www.w3.org/html/wg/drafts/html/CR/syntax.html#determining-the-character-encoding

+++ This bug was initially created as a clone of Bug #21087 +++

The spec includes a table of locales and encoding defaults for those locales. The data for that table has been taken from Gecko 1.9.1 source code. It appears that the data hasn't been properly compared with the behavior of IE, which might have more significant market share in some of the locales involved. In particular, it looks suspicious that the Simplified Chinese is GB18030 rather than GBK and every entry that suggests UTF-8 as the encoding looks suspicious. For example, chances are that users of Welsh UI will be exposed to the same legacy content as the users of UK English UI. Also, Windows has a legacy code page specifically for Vietnamese, so it seems incredible that legacy content encountered by users of the Vietnamese locale would more often be UTF-8 that mean that code page.

In order to avoid spreading bugs, please remove all the entries that haven't been cross-checked to agree with the defaults of a version of Internet Explorer that predates the inclusion of the table in the spec. If such cross-checking can be performed in a timely manner, please at least remove all the entries that claim that the default should be UTF-8 or GB18030 for the time being.
Comment 1 Henri Sivonen 2013-02-22 13:20:54 UTC
If such cross-checking *can't*…
Comment 2 Travis Leithead [MSFT] 2013-09-16 22:53:08 UTC
A fix for this was checked in last June by Ian:

https://github.com/w3c/html/commit/6ec65943fbf4991518832455b354d2aaddf082a1
Comment 3 Travis Leithead [MSFT] 2013-09-16 22:58:03 UTC
Confirmed that Ian's change was integrated into CR in Robin's Mega-merge.