This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 21087 - Spec repeats potential Gecko bugs about encoding defaults as the truth
Summary: Spec repeats potential Gecko bugs about encoding defaults as the truth
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 21088
  Show dependency treegraph
 
Reported: 2013-02-22 13:15 UTC by Henri Sivonen
Modified: 2013-06-13 03:12 UTC (History)
3 users (show)

See Also:


Attachments

Description Henri Sivonen 2013-02-22 13:15:58 UTC
The spec includes a table of locales and encoding defaults for those locales. The data for that table has been taken from Gecko 1.9.1 source code. It appears that the data hasn't been properly compared with the behavior of IE, which might have more significant market share in some of the locales involved. In particular, it looks suspicious that the Simplified Chinese is GB18030 rather than GBK and every entry that suggests UTF-8 as the encoding looks suspicious. For example, chances are that users of Welsh UI will be exposed to the same legacy content as the users of UK English UI. Also, Windows has a legacy code page specifically for Vietnamese, so it seems incredible that legacy content encountered by users of the Vietnamese locale would more often be UTF-8 that mean that code page.

In order to avoid spreading bugs, please remove all the entries that haven't been cross-checked to agree with the defaults of a version of Internet Explorer that predates the inclusion of the table in the spec. If such cross-checking can be performed in a timely manner, please at least remove all the entries that claim that the default should be UTF-8 or GB18030 for the time being.
Comment 1 Henri Sivonen 2013-02-22 13:20:22 UTC
If such cross-checking *can't*…
Comment 2 Ian 'Hixie' Hickson 2013-04-12 20:12:03 UTC
I have no idea how to perform such cross-checking. If I were to remove everything that happened been cross-checked, I'd just remove the table.

Removing entries that say UTF-8 wouldn't really do anything since it would just leave that locale undefined, and we know at least one UA does use UTF-8.

I spot checked some of the locales where there's an east correlation between language and country and where the encoding in the table was UTF-8, and the ones I looked at all had Firefox at a higher market share than IE's market share, FWIW.

This page gives Chrome's defaults; I guess I could use this to fill out the table some more and cross-check those locales that are in both Firefox and Chrome?:

https://code.google.com/p/chromium/codesearch#search/&q=IDS_DEFAULT_ENCODING
Comment 3 Henri Sivonen 2013-04-19 14:14:57 UTC
You'll find that Firefox trunk, Chrome and IE now agree for Simplified Chinese and disagree with the spec. Also, you'll find that Firefox is confused for Czech and disagrees with itself across platforms (as if the purpose of the fallback was dealing with local legacy text files as opposed to legacy Web content). For Vietnamese, if I had to bet, I'd bet for IE agreeing with Chrome rather than the spec and Firefox. Still Firefox seems to have a higher market share than IE, but Chrome seems to have an even higher market share in Vietnam.

I'd prefer you remove the table than keep distributing faulty data as spec.
Comment 4 Ian 'Hixie' Hickson 2013-04-26 23:17:58 UTC
Well we clearly need a table, since browsers don't just default to UTF-8, however much we'd like them to.

But I agree that it would make sense to only include rows that have interoperability. Since we have access to the Chrome and Firefox data, should we just have the table include rows that are the same in both, and leave the others undefined?
Comment 5 Peter Occil 2013-04-27 18:40:58 UTC
I think this may be helpful. It shows which encoding is in use in different locales in Windows 7 (under the column "ANSI codepage").

http://msdn.microsoft.com/en-us/goglobal/bb896001

I don't know if Internet Explorer applies these encoding defaults, but the table may still be useful in any case. Note that the "ANSI codepage" is a number, not a name, that corresponds to a Windows codepage (for example, 1252 corresponds to "windows-1252").  The number will be 0 if there is no corresponding Windows codepage.
Comment 6 Henri Sivonen 2013-05-24 09:32:01 UTC
If Firefox, Chrome and the table from comment 5 all agree, then it's virtually certain that an entry is right. Otherwise, probably better to put some kind of "research needed" placeholder in the table than to claim a particular encoding.

As a general heuristic, the legacy Windows code page is typically right, but the market share of Firefox in Poland and some nearby countries defies the default heuristic.
Comment 7 Peter Occil 2013-05-26 16:47:10 UTC
At least currently,  Firefox uses the GetLocaleInfoW function to retrieve the locale's encoding on Windows [1], which means that the results will likely be the same as the table in Comment 5 (which, it seems, also uses GetLocaleInfoW as its source.) However, Firefox always uses UTF-8 as the fallback encoding for Mac OS and Android [2][3], and uses nl_langinfo and a now-deprecated character-encoding mapping for Linux, etc. [4]

[1]: http://mxr.mozilla.org/mozilla-release/source/intl/locale/src/windows/nsWinCharset.cpp
[2]: http://mxr.mozilla.org/mozilla-release/source/intl/locale/src/mac/nsMacCharset.cpp
[3]: http://mxr.mozilla.org/mozilla-release/source/intl/locale/src/unix/nsAndroidCharset.cpp
[4]: http://mxr.mozilla.org/mozilla-release/source/intl/locale/src/unix/unixcharset.properties
Comment 8 Henri Sivonen 2013-05-27 06:29:02 UTC
AFAICT, that code isn't used for the Web but is used for dealing with local file names.
Comment 9 Peter Occil 2013-05-27 12:32:49 UTC
For HTML documents, Firefox actually uses the value of the preference "intl.charset.default" (which corresponds to the Default Character Encoding dropdown box) as a fallback encoding.  The preference's initial value depends on the browser's language.  For example, the initial value of "intl.charset.default" for the Japanese version is Shift_JIS [1].  The preference is used in the TryWeakDocTypeDefault method in nsHTMLDocument; however, if the encoding isn't an ASCII-compatible character encoding, the value "windows-1252"  is used instead [2].

[1]: http://hg.mozilla.org/releases/l10n/mozilla-release/ja/file/55fc0e0c4712/toolkit/chrome/global/intl.properties
[2]: http://mxr.mozilla.org/mozilla-release/source/content/html/document/src/nsHTMLDocument.cpp
Comment 10 Ian 'Hixie' Hickson 2013-06-12 00:40:42 UTC
I've been working on this. Here's the problematic ones (ones for which I have no idea what the right choice should be):

Locale     Description     Vista           Chrome          Spec/Firefox
ro         Romanian        windows-1250    ISO-8859-2      windows-1252
cs         Czech           windows-1250    windows-1250    ISO-8859-2
hu         Hungarian       windows-1250    ISO-8859-2      ISO-8859-2
lv         Latvian         windows-1257    windows-1257    ISO-8859-13
sl         Slovenian       windows-1250    ISO-8859-2      ISO-8859-2
pl         Polish          windows-1250    ISO-8859-2      ISO-8859-2
be         Belarusian      windows-1251    <none>          ISO-8859-5
el         Greek           windows-1253    ISO-8859-7      windows-1252

Any suggestions for those?
Comment 11 Ian 'Hixie' Hickson 2013-06-12 04:49:00 UTC
I tentatively went with majority rule on those. I tried to document the decisions in each case for the locales now listed in the spec. Let me know if you still think any of them should be removed. None of the remaining ones default to UTF-8 (e.g. Welsh is now the same as the UK, because nobody else cared about 'cy', and Vista said 'cy-GB' was win1252). I treated Windows-936, GBK, and GB18030 as basically the same encoding, since each is a superset of the previous one for most intents and purposes, so zh-CN uses it (in practice, Windows Vista wanted to use win936, Chrome wanted to use GBK, and Gecko wanted to use GB18030). That's based on what Wikipedia says; I do note that encoding.spec.whatwg.org does treat the latter two as distinct (though as far as I can tell, it just treats GB18030 as a superset since it decodes it using the GBK algorithm with a flag set that just allows more codepoints, or something). If there's a reason to use GBK instead, we can use that, let me know (why does that encoding even exist? I couldn't find any discussion of it other than Wikipedia which says it's a subset, as mentioned above). Vietnamese uses win1258.
Comment 12 contributor 2013-06-12 04:49:07 UTC
Checked in as WHATWG revision r7958.
Check-in comment: New encoding defaults based on more data.
http://html5.org/tools/web-apps-tracker?from=7957&to=7958
Comment 13 Henri Sivonen 2013-06-12 07:48:34 UTC
(In reply to comment #11)
> I tentatively went with majority rule on those. I tried to document the
> decisions in each case for the locales now listed in the spec. Let me know
> if you still think any of them should be removed. None of the remaining ones
> default to UTF-8 (e.g. Welsh is now the same as the UK, because nobody else
> cared about 'cy', and Vista said 'cy-GB' was win1252).

Thanks. 

(FWIW, in my testing, en-US Windows 7 base install plus Welsh Windows 7 language pack plus Welsh IE9 resulted in a completely bogus default: koi8-r. I think it makes sense to make Welsh default to Windows-1252. The maintainer of the Firefox localization seems to disagree and I have higher-priority stuff to work on.)

> I treated
> Windows-936, GBK, and GB18030 as basically the same encoding, since each is
> a superset of the previous one for most intents and purposes, so zh-CN uses
> it (in practice, Windows Vista wanted to use win936, Chrome wanted to use
> GBK, and Gecko wanted to use GB18030).

Gecko defaults to GBK on trunk and treats GBK and GB18030 as distinct as does The Encoding Standard. Is GB18030 really a superset of GBK? (I gather both GBK and GB18030 are supersets of GB2312.)

> That's based on what Wikipedia says;
> I do note that encoding.spec.whatwg.org does treat the latter two as
> distinct (though as far as I can tell, it just treats GB18030 as a superset
> since it decodes it using the GBK algorithm with a flag set that just allows
> more codepoints, or something). If there's a reason to use GBK instead, we
> can use that, let me know (why does that encoding even exist?

Gecko switched to GBK for consistency with IE, which has more market share than Firefox in China.

I wonder if users in Hong Kong and Singapore use Big5-defaulting browsers, too. At least Firefox zh-TW is marketed as “Chinese (Traditional)”. Maybe the spec should use the Traditional Chinese & Simplified Chinese taxonomy instead of a regional taxonomy. Gecko's internal use of zh-CN and zh-TW predates the IANA registration of zh-Hans and zh-Hant.
Comment 14 Ian 'Hixie' Hickson 2013-06-12 16:36:15 UTC
(In reply to comment #13)
> 
> Gecko defaults to GBK on trunk and treats GBK and GB18030 as distinct as
> does The Encoding Standard. Is GB18030 really a superset of GBK? (I gather
> both GBK and GB18030 are supersets of GB2312.)

I'll talk to Anne.


> I wonder if users in Hong Kong and Singapore use Big5-defaulting browsers,
> too. At least Firefox zh-TW is marketed as “Chinese (Traditional)”. Maybe
> the spec should use the Traditional Chinese & Simplified Chinese taxonomy
> instead of a regional taxonomy. Gecko's internal use of zh-CN and zh-TW
> predates the IANA registration of zh-Hans and zh-Hant.

The names I used in the spec come straight from the Vista page cited above. I don't know enough to know what the best answer is here.
Comment 15 Ian 'Hixie' Hickson 2013-06-13 02:23:41 UTC
Talked to Anne, who pointed me to bug 16862; GBK/GB18030 are still in flux. I'm going to call it in favour of GB18030 for now, and if that causes a concrete compat issue with deployed content, we can revisit it (in a new bug).

For the new names I'm leaving them as is until someone gets offended, then I'll change it until someone gets offended again, and I'll just keep flipping those back and forth forever. It's not normative, after all. :-)

Thanks for prodding me to do this work, the spec is definitely the better for it.
Comment 16 Peter Occil 2013-06-13 03:12:07 UTC
I should note that "windows-949" in the table should be "euc-kr" instead, which is the preferred name in the Encoding Standard.