This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 27868 - EUC-KR and encoding-only mapping (fromUnicode)
Summary: EUC-KR and encoding-only mapping (fromUnicode)
Status: RESOLVED INVALID
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-20 18:54 UTC by Jungshik Shin
Modified: 2015-08-21 09:58 UTC (History)
4 users (show)

See Also:


Attachments
ICU's windows-949 : decoding only entries (7.87 KB, text/plain)
2015-01-20 19:04 UTC, Jungshik Shin
Details

Description Jungshik Shin 2015-01-20 18:54:26 UTC
When I compared the mapping of EUC-KR in the encoding spec with ICU's Windows-949 [1] (that was obtained by scraping *one of Windows' converters*, I found the following differences:

1. ICU's Windows-949 mapping has 395 'decoding only' (from Unicode to windows-949) entries for characters like 'Currency Sign cent' (U+00A2, U+00A3), regular Latin/Greek/Cyrillic letters, and Hangul Conjoining Jamos (U+11xx), Hangul half-width jamos (U+FFxx), enclosed CJK characters (e.g. U+32xx ) etc. 

2. ICU's Windows-949 has 190 additional round-trip mapping entries. Most of them  (188 of them) are for the two user-defined blocks in KS X 1001 (in EUC-KR, "C9 [A1-FE]" and "FE [A1-FE]") that are mapped to PUA code points (U+E000 - U+E0BB). The remaining two are U+0080 and U+F8F7 mapped to 0x80 and 0xFF. 

I don't think that we want to support the two user-defined blocks in KS X 1001. I'm not sure about U+0080 and U+F8F7. 

However, I believe that quite many (NOT all) of 'decoding only' entries had better be supported. 


[1] https://code.google.com/p/chromium/codesearch#chromium/src/third_party/icu/source/data/mappings/windows-949-2000.ucm&q=windows-949-2000.ucm&sq=package:chromium&type=cs
Comment 1 Jungshik Shin 2015-01-20 19:04:46 UTC
Created attachment 1565 [details]
ICU's windows-949 : decoding only entries
Comment 2 Anne 2015-01-21 09:59:37 UTC
If you go from Unicode to euc-kr, it is called encoding, not decoding. E.g. the stuff you need for <form> and URL.
Comment 3 Jungshik Shin 2015-01-21 11:23:38 UTC
You're absolutely right ! I must have had more 'coffee' ;-)
Comment 4 Jungshik Shin 2015-01-21 11:24:23 UTC
The attachment title should be changed to 'encoding only entries'(In reply to Jungshik Shin from comment #1)
> Created attachment 1565 [details]
> ICU's windows-949 : decoding only entries

This should be 'ICU's windows-949 : encoding only entries'.
Comment 5 Anne 2015-08-19 12:17:21 UTC
So you attached 394 "encoding only" entries. How should I know which ones we want to add to the standard and which we want to ignore?
Comment 6 Anne 2015-08-21 09:28:14 UTC
I tested your attached code points.

Chrome and Firefox encode them as "HTML entities". The default error handling mode.

Safari has these 394 mappings.

Internet Explorer outputs "HTML entities" too, however, they're not always numeric, but are sometimes named. This is truly bizarre.

Anyway, given these results, I don't think any changes are warranted here, as only Safari does what you suggest, but legacy content is far more likely to rely on what Internet Explorer does, which is pretty close to what Chrome, Firefox, and the Standard do (and often matches).

https://dump.testsuite.org/encoding/form-encoding-special-euc-kr.html
Comment 7 Jungshik Shin 2015-08-21 09:58:34 UTC
Chrome used to behave like Safari until I changed its EUC-KR to use the current encoding spec. So, the following is a bit circular. 

> Chrome and Firefox encode them as "HTML entities". The default error handling mode.

Anyway, it's not terribly important.