This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 28740 - GB18030-2000 vs GB18030-2005: decide on mapping for 0xA8BC
Summary: GB18030-2000 vs GB18030-2005: decide on mapping for 0xA8BC
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-06-02 22:03 UTC by Jungshik Shin
Modified: 2015-12-10 21:34 UTC (History)
6 users (show)

See Also:


Attachments
GB18030-2000 vs GB18030-2005 : PUA =>regular (576 bytes, text/plain)
2015-06-02 22:49 UTC, Jungshik Shin
Details

Description Jungshik Shin 2015-06-02 22:03:02 UTC
GB18030-2005 appears to map some 2-byte sequences to regular Unicode code points as opposed to PUA code points in BMP. 

For instance, GB18030-2000 (and the current encoding spec and ICU's gb18030) maps \xFE\x51 to U+E816. However, GB18030-2005 appears to map \xFE\x51 to U+20087. [1]

The glyph for U+E816 in Simsun in Windows 8 visually matches the code chart glyph for U+20087 ( ( http://www.fileformat.info/info/unicode/char/20087/index.htm ). 


I don't know how to represent U+E816 in GB18030-2005 because there's no gap in 4-byte sequence. The glibc implementation regards it as illegal, but it may not be supposed to do that.[2] 


I propose that a note be added to the spec that it's GB18030-2000 instead of GB18030-2005. 

[1] 
I couldn't get hold of GB18030-2005 spec and I'm using glibc's iconv as a proxy:

$ printf '\xfe\x51' | LC_ALL=C iconv -t UTF-32BE -f GB18030 | hexdump -C
00000000  00 02 00 87                                     

[2] 
$ printf '\xe8\x16' | LC_ALL=C iconv -f UTF-16BE -t GB18030 | hexdump -C
iconv: illegal input sequence at position 0
Comment 1 Jungshik Shin 2015-06-02 22:49:23 UTC
Created attachment 1612 [details]
GB18030-2000 vs GB18030-2005 : PUA =>regular

The attachment lists all the  PUA code points for which Simsun (font on Windows) have glyphs.

The first column is GB18030 byte sequences (2-byte). The second is GB18030-2000 Unicode mapping (PUA) and the third is GB18030-2005 (presumably if glibc's iconv is correct [1] ) Unicode mapping (non-PUA). 

Simsun have glyphs for PUA code points, but it does not cover regular non-PUA code points (3rd column). 

A new Simplfiied Chinese font on Windows (Microsoft Yahei) does cover non-PUA code points (3rd column) while it does not cover PUA code points (2nd column). 


[1] At least for U+FE10 .. U+FE19, it's very likely that it's correct. Those characters were added to Unicode 4.1 in March 2005 (see http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{subhead=Glyphs%20for%20vertical%20variants} ).
Comment 2 Jungshik Shin 2015-06-02 22:58:55 UTC
Initially, I thought we'd better stick to GB18030-2000, but having checked the coverage of Windows fonts (as well as fonts on Android such as Noto Sans CJK), I have a second thought. It's not clear what would be the best. 

The characters in the attachment are not likely to be used very often and it may not matter much. Vertical variants are just there for GB18030 compatibility and very few documents would use them explicitly (moreover, Simplified Chinese very rarely, if used at all, uses vertical layout ). 

If it affects only an extremely small # of documents, it can be argued that a newer mapping is better (GB18030-2005).
Comment 3 Jungshik Shin 2015-06-03 20:44:50 UTC
Webkit and Blink have these for GBK (but not gb18030 [1]).   

 switch (character) {
    case 0x01F9:
        return 0xE7C8;
    case 0x1E3F:
        return 0xE7C7;
    case 0x22EF:
        return 0x2026;
    case 0x301C:
        return 0xFF5E;
    }

What the above code snippet does is add one-way mapping (fromUnicode) 

1. U+01F9 => xA8xBF    
   ICU's GBK (windows-936) has U+E7C8 <=> xA8xBF
   The encoding spec and ICU's gb18030 have U+01F9 <=> xA8xBF

  This one is easy. I'll change Chrome's GBK to use U+01F9 instead of U+E7C8 (PUA) for xA8xBF. 
   
2. U+1E3F => xA8xBC     

  ICU's GBK has U+E7C7 <=> xA8xBC while its gb18030 has U+1E3F <=> xA8xBC

  index-gb18030 also has PUA mapping ( U+E7C7) for xA8xBC. 
  U+1E3F has been in the Unicode since 1.1.0. 

  Anyway, this may be another case of GB18030-2000 vs GB18030-2005. 

  And, I propose that the spec be changed to use U+1E3F for xA8xBC instead of U+E7C7 (PUA)

3. U+22EF => xA1xAD

   All three (the spec, GBK and GB18030 in ICU) have U+2026 <=> xA1xAD.

   U+2026 : Horizontal Ellipsis 
   U+22EF : Midline horizontal ellipsis

4. U+301C => xA1xAB

   All three have U+FF5E <=> xA1xAB

   U+FF5E : full-width tilde
   U+301C : wave dash

#3 and #4 should be dealt with separately even if we want to consider them. My gut sense is that it's not that important. I guess Webkit did that because the old Mac converter uses U+301C and U+22EF instead of U+FF5E and u+2026. 


As I wrote above, #1 is a Chromium issue. 

Only #2 is relevant here. 

We can generalize this bug to decide what to do about PUA code points in GB18030 and GBK. 

IMHO, we'd better avoid mapping to PUA code points as much as possible. If there are regular encoded Unicode characters, we'd better use them, instead. That is more or less in line with using GB18030-2005 mapping instead of 2000. 




   












[1] 
Blink code link: https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/wtf/text/TextCodecICU.cpp&l=380
Comment 4 Anne 2015-08-19 11:26:39 UTC
1) Are you not concerned with gb18030 no longer being a UTF?

2) Is your suggestion here as follows:

* Modify the mapping of xA8xBC.
* Indicate gb18030's mapping is GB18030-2000 with the exception of the mapping for xA8xBC?

Copying other implementers in case they have thoughts.
Comment 5 Masatoshi Kimura 2015-08-19 14:51:17 UTC
(In reply to Jungshik Shin from comment #1)
> Created attachment 1612 [details]
> GB18030-2000 vs GB18030-2005 : PUA =>regular
> 
> The attachment lists all the  PUA code points for which Simsun (font on
> Windows) have glyphs.
> 
> The first column is GB18030 byte sequences (2-byte). The second is
> GB18030-2000 Unicode mapping (PUA) and the third is GB18030-2005 (presumably
> if glibc's iconv is correct [1] ) Unicode mapping (non-PUA). 
> 
> Simsun have glyphs for PUA code points, but it does not cover regular
> non-PUA code points (3rd column). 
> 
> A new Simplfiied Chinese font on Windows (Microsoft Yahei) does cover
> non-PUA code points (3rd column) while it does not cover PUA code points
> (2nd column). 
> 
> 
> [1] At least for U+FE10 .. U+FE19, it's very likely that it's correct. Those
> characters were added to Unicode 4.1 in March 2005 (see
> http://unicode.org/cldr/utility/list-unicodeset.
> jsp?a=\p{subhead=Glyphs%20for%20vertical%20variants} ).

No, the GB 18030-2005 standard did NOT change those mappings. Glibc is wrong. The only change between GB 18030-2005 and GB 18030-2000 is swapping a mapping for LATIN SMALL LETTER M WITH ACUTE. Here is the table E.2 taken and translated from the standard:
> GB 18030   -2005  -2000
> 0xA8BC     U+1E3F U+E7C7
> 0x8135F437 U+E7C7 U+1E3F
Comment 6 Masatoshi Kimura 2015-08-19 15:56:22 UTC
FYI, GB18030-2005 (and other mandatory Chinese standards) are freely available here (requires Adobe Reader and FileOpen plug-in):
http://gb123.sac.gov.cn/gb/index
Comment 7 Anne 2015-08-19 16:33:43 UTC
Thank you Masatoshi. It seems to me that Safari/Chrome implement 2005. Whereas Internet Explorer and Firefox implement 2000. My inclination is to stick with 2000 and add a note to clarify that, but advice appreciated.
Comment 8 Anne 2015-08-19 16:34:44 UTC
Simple file that shows the difference in browsers for decoding the byte sequence from comment 5: https://dump.testsuite.org/encoding/gb18030-A8BC.html
Comment 10 Jungshik Shin 2015-08-21 17:18:00 UTC
> It seems to me that Safari/Chrome implement 2005. Whereas Internet Explorer and Firefox implement 2000. My inclination is to stick with 2000 and add a note to clarify that, but advice appreciated.(In reply to Anne from comment #7)
> Thank you Masatoshi. It seems to me that Safari/Chrome implement 2005.
> Whereas Internet Explorer and Firefox implement 2000. My inclination is to
> stick with 2000 and add a note to clarify that, but advice appreciated.

My preference is to use 2005 standard (my comment #0 was different from this stance) and avoid PUA code points as much as possible.
Comment 11 Anne 2015-08-21 17:23:05 UTC
Jungshik, do you mean you want to make the swap mentioned at the end of comment 5?

> GB 18030   -2005  -2000
> 0xA8BC     U+1E3F U+E7C7
> 0x8135F437 U+E7C7 U+1E3F
Comment 12 Jungshik Shin 2015-08-21 17:42:27 UTC
(In reply to Masatoshi Kimura from comment #6)
> FYI, GB18030-2005 (and other mandatory Chinese standards) are freely
> available here (requires Adobe Reader and FileOpen plug-in):
> http://gb123.sac.gov.cn/gb/index

Thank you for the pointer. No wonder I couldn't find it. 

Anyway, I downloaded the PDF, but none of PDF viewers on my Mac can open it. 
(for Chrome, I filed a bug at http://crbug.com/523425 ). 

Can you print the PDF to another PDF and upload somewhere (if it's possible)? Thanks
Comment 13 Masatoshi Kimura 2015-08-21 21:18:41 UTC
From <http://crbug.com/523425>:
> When I did that with Adobe Acroread, I was sent to a plug-in download page at
> http://plugin.fileopen.com/Default.aspx?type=Filter&name=FOPN_foweb&bhcp=1.
> I didn't try to install it.

You will have to install the plug-in to open the PDF. The PDF is protected by the FileOpen DRM (so printing PDF would not be possible). Looks like the plug-in supports Mac [1], but I have no Mac device to try. I could open the PDF on Windows with the plug-in.

[1] http://plugin.fileopen.com/all.aspx
Comment 14 Jungshik Shin 2015-08-24 04:51:06 UTC
(In reply to Masatoshi Kimura from comment #13)
> From <http://crbug.com/523425>:
> > When I did that with Adobe Acroread, I was sent to a plug-in download page at
> > http://plugin.fileopen.com/Default.aspx?type=Filter&name=FOPN_foweb&bhcp=1.
> > I didn't try to install it.
> 
> You will have to install the plug-in to open the PDF. The PDF is protected
> by the FileOpen DRM (so printing PDF would not be possible). Looks like the
> plug-in supports Mac [1], but I have no Mac device to try. I could open the
> PDF on Windows with the plug-in.
> 
> [1] http://plugin.fileopen.com/all.aspx

Yeah, I realized that FileOpen plug-in is for DRM (read their FAQ after posting the previous comment).
Comment 15 Jungshik Shin 2015-12-10 21:34:25 UTC
Filed https://github.com/whatwg/encoding/issues/22 to continue the discussion.