This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 17053 - Support KOI8-RU mapping for KOI8-U
Summary: Support KOI8-RU mapping for KOI8-U
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: All Windows 3.1
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-05-14 22:23 UTC by pub-w3
Modified: 2015-08-19 13:36 UTC (History)
9 users (show)

See Also:


Attachments

Description pub-w3 2012-05-14 22:23:43 UTC
IE takes the labels koi8-u and koi8-ru to mean KOI8-RU and not KOI8-U.  The difference is that KOI8-RU has an additional letter needed for Byelorussian,

AE:  U+045E  (ў) and
BE:  U+040E  (Ў),

where KOI8-U has line-drawing characters,

AE:  U+255D  (╝) and
BE:  U+256C  (╬).

Letters are arguably more important than box-drawing characters, so KOI8-RU might be a better choice than KOI-8-U, at least if it can be shown that koi8-(r)u is used for Byelorussian (i.e., that AE/BE are used to encode ў/Ў).
Comment 1 Alexey Proskuryakov 2012-10-11 20:17:58 UTC
KOI8-RU is not one of the aliases supported by ICU (see <http://demo.icu-project.org/icu-bin/convexp>). Is the encoding itself supported by ICU? HAving such back-end support would be useful for getting it supported in browsers.

I don't have any data on real life use of this encoding.
Comment 2 pub-w3 2012-10-22 21:30:40 UTC
(In reply to comment #1)
> Is [KOI8-RU] supported by ICU?

Apparently not:

$ grep '042F.*F1' mappings/* 
mappings/ibm-1168_P100-2002.ucm:<U042F> \xF1 |0
mappings/ibm-878_P100-1996.ucm:<U042F> \xF1 |0

All KOI-8 encodings encode the basic modern Russian letters identically.  In particular, Я (U+042F) is encoded as 0xF1.  Only KOI8-R (IBM-878) and KOI8-U (IBM-1168) match, so KOI8-RU is not supported.

> I don't have any data on real life use of this encoding.

Have you looked for 0xAE bytes in data labelled KOI8-U (or possibly KOI8-R)?
Comment 3 Anne 2012-11-16 14:32:21 UTC
Does IE also report koi8-ru as the encoding name (via the DOM)? I suppose if IE does this it might be more compatible, although IE is not dominant in that region (afaik).
Comment 4 pub-w3 2012-12-08 15:23:29 UTC
(In reply to comment #3)
> Does IE also report koi8-ru as the encoding name (via the DOM)?

document.charset returns koi8-u in IE9.

The encoding vector appears to have been changed from KOI8-U to KOI8-RU at some point between IE6 and IE9.  I assume this would not have happened in the absence of KOI8-RU content labelled as KOI8-U, but this may not be an issue for current Web content.
Comment 5 Anne 2013-09-04 09:32:28 UTC
Adrian, Travis, any idea here?
Comment 6 Travis Leithead [MSFT] 2013-09-04 17:58:51 UTC
I will need to have an encoding expert on our team look into this; offhand I don't know how prevalent this encoding is or why this change may have been made.
Comment 7 Travis Leithead [MSFT] 2013-09-12 17:32:22 UTC
We've searched IE's code base, and found that we've had this behavior since at least IE4. I can't prove it at the moment, but I suspect that what we're seeing here is an encoding compatibility decision that was made to align with Netscape at the time.

Due to the longevity of this behavior, I'm not very keen on changing it unless you can prove a significant web compatibility problem with it.
Comment 8 Anne 2013-12-11 16:32:38 UTC
It seems koi8-r also removes a few letters in favor of line-drawing characters. I wonder if just supporting koi8-ru would be sufficient.
Comment 9 Anne 2014-11-04 15:07:57 UTC
Simon, Jungshik, Joshua, Henri, last year Travis expressed disinterest in changing Internet Explorer for this encoding. Are Chromium and Gecko willing to change their implementation to match Internet Explorer?

Comment 0 describes the minor difference between the mapping in browsers.
Comment 10 Anne 2015-08-19 11:15:12 UTC
Feel free to reopen this once someone can address the question in comment 9. Long live the status quo of the majority...
Comment 11 Jungshik Shin 2015-08-19 12:33:47 UTC
Hmm..I missed this bug. Without doing any research but purely based on comment 0 and comment 7, I don't see a big issue with changing those two (0xAE, 0xBE). I'm not sure if it's worth a while to get to the bottom of it (data collection, etc) as it gets less  significant as time goes on.
Comment 12 Anne 2015-08-19 12:57:55 UTC
Alright, let's change it then. IE's behavior does seem slightly better.
Comment 13 Anne 2015-08-19 13:36:56 UTC
I did not change the name of the encoding per comment 4.

https://github.com/whatwg/encoding/commit/52f08a6259d331197685c6b417ee753b817c5a79