This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
GB18030-2005 appears to map some 2-byte sequences to regular Unicode code points as opposed to PUA code points in BMP. For instance, GB18030-2000 (and the current encoding spec and ICU's gb18030) maps \xFE\x51 to U+E816. However, GB18030-2005 appears to map \xFE\x51 to U+20087. [1] The glyph for U+E816 in Simsun in Windows 8 visually matches the code chart glyph for U+20087 ( ( http://www.fileformat.info/info/unicode/char/20087/index.htm ). I don't know how to represent U+E816 in GB18030-2005 because there's no gap in 4-byte sequence. The glibc implementation regards it as illegal, but it may not be supposed to do that.[2] I propose that a note be added to the spec that it's GB18030-2000 instead of GB18030-2005. [1] I couldn't get hold of GB18030-2005 spec and I'm using glibc's iconv as a proxy: $ printf '\xfe\x51' | LC_ALL=C iconv -t UTF-32BE -f GB18030 | hexdump -C 00000000 00 02 00 87 [2] $ printf '\xe8\x16' | LC_ALL=C iconv -f UTF-16BE -t GB18030 | hexdump -C iconv: illegal input sequence at position 0
Created attachment 1612 [details] GB18030-2000 vs GB18030-2005 : PUA =>regular The attachment lists all the PUA code points for which Simsun (font on Windows) have glyphs. The first column is GB18030 byte sequences (2-byte). The second is GB18030-2000 Unicode mapping (PUA) and the third is GB18030-2005 (presumably if glibc's iconv is correct [1] ) Unicode mapping (non-PUA). Simsun have glyphs for PUA code points, but it does not cover regular non-PUA code points (3rd column). A new Simplfiied Chinese font on Windows (Microsoft Yahei) does cover non-PUA code points (3rd column) while it does not cover PUA code points (2nd column). [1] At least for U+FE10 .. U+FE19, it's very likely that it's correct. Those characters were added to Unicode 4.1 in March 2005 (see http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{subhead=Glyphs%20for%20vertical%20variants} ).
Initially, I thought we'd better stick to GB18030-2000, but having checked the coverage of Windows fonts (as well as fonts on Android such as Noto Sans CJK), I have a second thought. It's not clear what would be the best. The characters in the attachment are not likely to be used very often and it may not matter much. Vertical variants are just there for GB18030 compatibility and very few documents would use them explicitly (moreover, Simplified Chinese very rarely, if used at all, uses vertical layout ). If it affects only an extremely small # of documents, it can be argued that a newer mapping is better (GB18030-2005).
Webkit and Blink have these for GBK (but not gb18030 [1]). switch (character) { case 0x01F9: return 0xE7C8; case 0x1E3F: return 0xE7C7; case 0x22EF: return 0x2026; case 0x301C: return 0xFF5E; } What the above code snippet does is add one-way mapping (fromUnicode) 1. U+01F9 => xA8xBF ICU's GBK (windows-936) has U+E7C8 <=> xA8xBF The encoding spec and ICU's gb18030 have U+01F9 <=> xA8xBF This one is easy. I'll change Chrome's GBK to use U+01F9 instead of U+E7C8 (PUA) for xA8xBF. 2. U+1E3F => xA8xBC ICU's GBK has U+E7C7 <=> xA8xBC while its gb18030 has U+1E3F <=> xA8xBC index-gb18030 also has PUA mapping ( U+E7C7) for xA8xBC. U+1E3F has been in the Unicode since 1.1.0. Anyway, this may be another case of GB18030-2000 vs GB18030-2005. And, I propose that the spec be changed to use U+1E3F for xA8xBC instead of U+E7C7 (PUA) 3. U+22EF => xA1xAD All three (the spec, GBK and GB18030 in ICU) have U+2026 <=> xA1xAD. U+2026 : Horizontal Ellipsis U+22EF : Midline horizontal ellipsis 4. U+301C => xA1xAB All three have U+FF5E <=> xA1xAB U+FF5E : full-width tilde U+301C : wave dash #3 and #4 should be dealt with separately even if we want to consider them. My gut sense is that it's not that important. I guess Webkit did that because the old Mac converter uses U+301C and U+22EF instead of U+FF5E and u+2026. As I wrote above, #1 is a Chromium issue. Only #2 is relevant here. We can generalize this bug to decide what to do about PUA code points in GB18030 and GBK. IMHO, we'd better avoid mapping to PUA code points as much as possible. If there are regular encoded Unicode characters, we'd better use them, instead. That is more or less in line with using GB18030-2005 mapping instead of 2000. [1] Blink code link: https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/wtf/text/TextCodecICU.cpp&l=380
1) Are you not concerned with gb18030 no longer being a UTF? 2) Is your suggestion here as follows: * Modify the mapping of xA8xBC. * Indicate gb18030's mapping is GB18030-2000 with the exception of the mapping for xA8xBC? Copying other implementers in case they have thoughts.
(In reply to Jungshik Shin from comment #1) > Created attachment 1612 [details] > GB18030-2000 vs GB18030-2005 : PUA =>regular > > The attachment lists all the PUA code points for which Simsun (font on > Windows) have glyphs. > > The first column is GB18030 byte sequences (2-byte). The second is > GB18030-2000 Unicode mapping (PUA) and the third is GB18030-2005 (presumably > if glibc's iconv is correct [1] ) Unicode mapping (non-PUA). > > Simsun have glyphs for PUA code points, but it does not cover regular > non-PUA code points (3rd column). > > A new Simplfiied Chinese font on Windows (Microsoft Yahei) does cover > non-PUA code points (3rd column) while it does not cover PUA code points > (2nd column). > > > [1] At least for U+FE10 .. U+FE19, it's very likely that it's correct. Those > characters were added to Unicode 4.1 in March 2005 (see > http://unicode.org/cldr/utility/list-unicodeset. > jsp?a=\p{subhead=Glyphs%20for%20vertical%20variants} ). No, the GB 18030-2005 standard did NOT change those mappings. Glibc is wrong. The only change between GB 18030-2005 and GB 18030-2000 is swapping a mapping for LATIN SMALL LETTER M WITH ACUTE. Here is the table E.2 taken and translated from the standard: > GB 18030 -2005 -2000 > 0xA8BC U+1E3F U+E7C7 > 0x8135F437 U+E7C7 U+1E3F
FYI, GB18030-2005 (and other mandatory Chinese standards) are freely available here (requires Adobe Reader and FileOpen plug-in): http://gb123.sac.gov.cn/gb/index
Thank you Masatoshi. It seems to me that Safari/Chrome implement 2005. Whereas Internet Explorer and Firefox implement 2000. My inclination is to stick with 2000 and add a note to clarify that, but advice appreciated.
Simple file that shows the difference in browsers for decoding the byte sequence from comment 5: https://dump.testsuite.org/encoding/gb18030-A8BC.html
https://github.com/whatwg/encoding/commit/257aa5b64f5ccae76b8ed20d87cc2895deb17f0a
> It seems to me that Safari/Chrome implement 2005. Whereas Internet Explorer and Firefox implement 2000. My inclination is to stick with 2000 and add a note to clarify that, but advice appreciated.(In reply to Anne from comment #7) > Thank you Masatoshi. It seems to me that Safari/Chrome implement 2005. > Whereas Internet Explorer and Firefox implement 2000. My inclination is to > stick with 2000 and add a note to clarify that, but advice appreciated. My preference is to use 2005 standard (my comment #0 was different from this stance) and avoid PUA code points as much as possible.
Jungshik, do you mean you want to make the swap mentioned at the end of comment 5? > GB 18030 -2005 -2000 > 0xA8BC U+1E3F U+E7C7 > 0x8135F437 U+E7C7 U+1E3F
(In reply to Masatoshi Kimura from comment #6) > FYI, GB18030-2005 (and other mandatory Chinese standards) are freely > available here (requires Adobe Reader and FileOpen plug-in): > http://gb123.sac.gov.cn/gb/index Thank you for the pointer. No wonder I couldn't find it. Anyway, I downloaded the PDF, but none of PDF viewers on my Mac can open it. (for Chrome, I filed a bug at http://crbug.com/523425 ). Can you print the PDF to another PDF and upload somewhere (if it's possible)? Thanks
From <http://crbug.com/523425>: > When I did that with Adobe Acroread, I was sent to a plug-in download page at > http://plugin.fileopen.com/Default.aspx?type=Filter&name=FOPN_foweb&bhcp=1. > I didn't try to install it. You will have to install the plug-in to open the PDF. The PDF is protected by the FileOpen DRM (so printing PDF would not be possible). Looks like the plug-in supports Mac [1], but I have no Mac device to try. I could open the PDF on Windows with the plug-in. [1] http://plugin.fileopen.com/all.aspx
(In reply to Masatoshi Kimura from comment #13) > From <http://crbug.com/523425>: > > When I did that with Adobe Acroread, I was sent to a plug-in download page at > > http://plugin.fileopen.com/Default.aspx?type=Filter&name=FOPN_foweb&bhcp=1. > > I didn't try to install it. > > You will have to install the plug-in to open the PDF. The PDF is protected > by the FileOpen DRM (so printing PDF would not be possible). Looks like the > plug-in supports Mac [1], but I have no Mac device to try. I could open the > PDF on Windows with the plug-in. > > [1] http://plugin.fileopen.com/all.aspx Yeah, I realized that FileOpen plug-in is for DRM (read their FAQ after posting the previous comment).
Filed https://github.com/whatwg/encoding/issues/22 to continue the discussion.