Bug 25396 - Incorrect mapping in index18030.txt
Summary: Incorrect mapping in index18030.txt
Status: RESOLVED INVALID
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-04-20 04:56 UTC by Alexander Shtuchkin
Modified: 2014-04-28 12:16 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Shtuchkin 2014-04-20 04:56:43 UTC
Input sequence A3 A0 in GB18030 is decoded as U+E5E5 by iconv and ICU. F.ex. 

> printf "\xA3\xA0" | iconv -f gb18030 -t utf-16le | hexdump
0000000 e5 e5

ICU table: http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml

Using the algorithm given in http://encoding.spec.whatwg.org/#gb18030-encoder, 
A3 A0 results in pointer 6555, which is mapped to U+3000 IDEOGRAPHIC SPACE in index18030.txt.

I believe this mapping incorrect and should be replaced with U+E5E5.
Comment 1 Martin Dürst 2014-04-21 08:43:01 UTC
For what it's worth, Ruby also produces U+E5E5:

prompt> ruby -e 'p "\xA3\xA0".encode("UTF-16BE", "GB18030")'
"\uE5E5"
Comment 2 Jungshik Shin 2014-04-21 22:59:53 UTC
I'm pretty sure I added a comment here (well, it's on my phone and I may have forgotten to press 'save changes' button.).  

Anyway, I think we'd better keep the current mapping as it is. Mapping to a PUA code point does not make much sense.

Webkit/Blink actually overrides the ICU mapping and map 'xA3 xA0' to U+3000. See http://goo.gl/ocjnDR
Comment 3 Anne 2014-04-22 10:17:17 UTC
I should probably add a note about this in http://encoding.spec.whatwg.org/#indexes
Comment 4 Alexander Shtuchkin 2014-04-28 10:35:29 UTC
So, I guess it's just a matter of policy. Choosing WebKit as an authority makes a lot of sense to me. Thank you for explanation!