Bugzilla – Bug 25396
Incorrect mapping in index18030.txt
Last modified: 2014-04-28 12:16:57 UTC
Input sequence A3 A0 in GB18030 is decoded as U+E5E5 by iconv and ICU. F.ex.
> printf "\xA3\xA0" | iconv -f gb18030 -t utf-16le | hexdump
0000000 e5 e5
ICU table: http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
Using the algorithm given in http://encoding.spec.whatwg.org/#gb18030-encoder,
A3 A0 results in pointer 6555, which is mapped to U+3000 IDEOGRAPHIC SPACE in index18030.txt.
I believe this mapping incorrect and should be replaced with U+E5E5.
For what it's worth, Ruby also produces U+E5E5:
prompt> ruby -e 'p "\xA3\xA0".encode("UTF-16BE", "GB18030")'
I'm pretty sure I added a comment here (well, it's on my phone and I may have forgotten to press 'save changes' button.).
Anyway, I think we'd better keep the current mapping as it is. Mapping to a PUA code point does not make much sense.
Webkit/Blink actually overrides the ICU mapping and map 'xA3 xA0' to U+3000. See http://goo.gl/ocjnDR
I should probably add a note about this in http://encoding.spec.whatwg.org/#indexes
So, I guess it's just a matter of policy. Choosing WebKit as an authority makes a lot of sense to me. Thank you for explanation!