This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 23089 - Incomplete coverage of Cyrillic languages that should imply a windows-1251 fallback
Summary: Incomplete coverage of Cyrillic languages that should imply a windows-1251 fa...
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-08-29 11:51 UTC by Henri Sivonen
Modified: 2013-11-06 21:38 UTC (History)
2 users (show)

See Also:


Attachments

Description Henri Sivonen 2013-08-29 11:51:34 UTC
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding  is incomplete in its coverage of Cyrillic languages.

be: The Belarusian localization of Firefox has windows-1251 as the fallback, so it's virtually certain that the spec should require this.

kk: The Kazakh localization of Firefox currently has UTF-8 as the fallback, and we have telemetry data indicating that it is a bad fallback, so it's virtually certain that the spec should require a windows-1251 fallback for Kazakh.

Considering Windows code page legacy and, in some cases, relationship with Russia, it's reasonable to guess that also the following should *probably* fall back to windows-1251: 
ba (Bashkir)
ky (Kyrgyz)
mk (Macedonian)
tg (Tajik)
tt (Tatar)
sah (Yakut)

Probably best to check this latter list with someone who actually knows.
Comment 1 Ian 'Hixie' Hickson 2013-08-30 18:06:41 UTC
The current requirements are from bug 21087, where you said "In order to avoid spreading bugs, please remove all the entries that haven't been cross-checked to agree with the defaults of a version of Internet Explorer that predates the inclusion of the table in the spec". What changed?

These are the notes I have in the spec for those locales:

<!-- be, Belarusian, is not listed here because Windows Vista wanted windows-1251, Chrome wanted <none>, and Firefox wanted ISO-8859-5 -->
<!-- ba-RU, Bashkir (Russia), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->
<!-- ky, Kyrgyz, is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->
<!-- mk, Macedonian, is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->
<!-- tg-Cyrl-TJ, Tajik (Cyrillic, Tajikistan), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->
<!-- tt, Tatar, is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->
<!-- sah-RU, Yakut (Russia), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->
Comment 2 Henri Sivonen 2013-09-06 07:36:12 UTC
> The current requirements are from bug 21087, where you said "In order to 
> avoid spreading bugs, please remove all the entries that haven't been 
> cross-checked to agree with the defaults of a version of Internet 
> Explorer that predates the inclusion of the table in the spec". What 
> changed?

Doesn't windows-1252 agree with IE?

> <!-- be, Belarusian, is not listed here because Windows Vista wanted windows-1251, Chrome wanted <none>, and Firefox wanted ISO-8859-5 -->

https://mxr.mozilla.org/l10n-mozilla-release/search?string=intl.charset.default&find=intl.properties says windows-1251 in Firefox.

<!-- ba-RU, Bashkir (Russia), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->
<!-- tt, Tatar, is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->
<!-- sah-RU, Yakut (Russia), is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->

These are minority languages of Russia that use the Cyrillic script. It seems reasonable to expect the users to have to often browse ru-RU content and it would seem strange for the Cyrillic legacy for these languages in Russia to differ from the legacy of ru-RU.

> <!-- mk, Macedonian, is not listed here because neither Chrome nor Firefox knew about it. For what it's worth, Windows Vista wanted windows-1251 -->

Firefox now has a localization for mk which sets UTF-8 as the fallback. For the obvious reasons, I find it *extremely* hard to believe that UTF-8 is the right answer.
Comment 3 Ian 'Hixie' Hickson 2013-11-06 21:38:02 UTC
Ok, seems reasonable.
Comment 4 contributor 2013-11-06 21:38:43 UTC
Checked in as WHATWG revision r8258.
Check-in comment: Add some more locales to the default encoding logic.
http://html5.org/tools/web-apps-tracker?from=8257&to=8258