27868 – EUC-KR and encoding-only mapping (fromUnicode)

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 27868 - EUC-KR and encoding-only mapping (fromUnicode)

Summary: EUC-KR and encoding-only mapping (fromUnicode)

Status:	RESOLVED INVALID

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-01-20 18:54 UTC by Jungshik Shin
Modified:	2015-08-21 09:58 UTC (History)
CC List:	4 users (show)

See Also:

Attachments
ICU's windows-949 : decoding only entries (7.87 KB, text/plain) 2015-01-20 19:04 UTC, Jungshik Shin	Details

Description Jungshik Shin 2015-01-20 18:54:26 UTC

When I compared the mapping of EUC-KR in the encoding spec with ICU's Windows-949 [1] (that was obtained by scraping *one of Windows' converters*, I found the following differences:

1. ICU's Windows-949 mapping has 395 'decoding only' (from Unicode to windows-949) entries for characters like 'Currency Sign cent' (U+00A2, U+00A3), regular Latin/Greek/Cyrillic letters, and Hangul Conjoining Jamos (U+11xx), Hangul half-width jamos (U+FFxx), enclosed CJK characters (e.g. U+32xx ) etc. 

2. ICU's Windows-949 has 190 additional round-trip mapping entries. Most of them  (188 of them) are for the two user-defined blocks in KS X 1001 (in EUC-KR, "C9 [A1-FE]" and "FE [A1-FE]") that are mapped to PUA code points (U+E000 - U+E0BB). The remaining two are U+0080 and U+F8F7 mapped to 0x80 and 0xFF. 

I don't think that we want to support the two user-defined blocks in KS X 1001. I'm not sure about U+0080 and U+F8F7. 

However, I believe that quite many (NOT all) of 'decoding only' entries had better be supported. 


[1] https://code.google.com/p/chromium/codesearch#chromium/src/third_party/icu/source/data/mappings/windows-949-2000.ucm&q=windows-949-2000.ucm&sq=package:chromium&type=cs

Comment 1 Jungshik Shin 2015-01-20 19:04:46 UTC

Created attachment 1565 [details]
ICU's windows-949 : decoding only entries

Comment 2 Anne 2015-01-21 09:59:37 UTC

If you go from Unicode to euc-kr, it is called encoding, not decoding. E.g. the stuff you need for <form> and URL.

Comment 3 Jungshik Shin 2015-01-21 11:23:38 UTC

You're absolutely right ! I must have had more 'coffee' ;-)

Comment 4 Jungshik Shin 2015-01-21 11:24:23 UTC

The attachment title should be changed to 'encoding only entries'(In reply to Jungshik Shin from comment #1)
> Created attachment 1565 [details]
> ICU's windows-949 : decoding only entries

This should be 'ICU's windows-949 : encoding only entries'.

Comment 5 Anne 2015-08-19 12:17:21 UTC

So you attached 394 "encoding only" entries. How should I know which ones we want to add to the standard and which we want to ignore?

Comment 6 Anne 2015-08-21 09:28:14 UTC

I tested your attached code points.

Chrome and Firefox encode them as "HTML entities". The default error handling mode.

Safari has these 394 mappings.

Internet Explorer outputs "HTML entities" too, however, they're not always numeric, but are sometimes named. This is truly bizarre.

Anyway, given these results, I don't think any changes are warranted here, as only Safari does what you suggest, but legacy content is far more likely to rely on what Internet Explorer does, which is pretty close to what Chrome, Firefox, and the Standard do (and often matches).

https://dump.testsuite.org/encoding/form-encoding-special-euc-kr.html

Comment 7 Jungshik Shin 2015-08-21 09:58:34 UTC

Chrome used to behave like Safari until I changed its EUC-KR to use the current encoding spec. So, the following is a bit circular. 

> Chrome and Firefox encode them as "HTML entities". The default error handling mode.

Anyway, it's not terribly important.