This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 28156 - Separate GBK and GB18030 even for decoding (toUnicode)
Summary: Separate GBK and GB18030 even for decoding (toUnicode)
Status: RESOLVED WONTFIX
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-06 18:48 UTC by Jungshik Shin
Modified: 2015-08-21 07:35 UTC (History)
5 users (show)

See Also:


Attachments

Description Jungshik Shin 2015-03-06 18:48:33 UTC
After bug 27235, GBK and GB18030 are distinct when encoding (fromUnicode). 

I guess the rationale for treating GBK and GB18030 identically when decodidng (toUnicode) is that there are (significant) number of pages that are actually in GB18030 but are mislabelled as GBK. 

I wonder if there's any statistics collected for that. I'm curious to know what percentage of documents labelled as GBK are actually in GB18030. My suspicion is that it's pretty low especially compared with 'ISO-8859-1 vs windows-1252', 'EUC-KR vs windows-949' (because it's so prevalent that the spec's EUC-KR is actually windows-949, which I fully support), 'TIS 620 : ISO-8859-11 : windows-864', and so forth. 

I'm raising this issue because 1) Blink, Webkit, Firefox (and I guess, IE, too) have treated two encodings separately  2) Blink need to add extra code to treat GBK/GB18030 as specified in the current spec. 

I believe that it's doable (I thought about how to do that yesterday), but I'm not convinced that it's worth the effort / extra code.
Comment 1 Anne 2015-03-12 11:50:29 UTC
I would have expected that treating them identically for decoding saves you a decoding table. Or would you reuse that anyway?

They're treated identically because gbk is effectively a subset and for the other encodings we've found that supersets leak. I think there might be some anecdotal evidence here too, but not sure.
Comment 2 Henri Sivonen 2015-03-12 16:35:15 UTC
(In reply to Jungshik Shin from comment #0)
> I'm raising this issue because 1) Blink, Webkit, Firefox (and I guess, IE,
> too) have treated two encodings separately 

Firefox no longer does, so going back to the old state would involve extra work for us...

(In reply to Anne from comment #1)
> I would have expected that treating them identically for decoding saves you
> a decoding table. Or would you reuse that anyway?

FWIW, the old code in Firefox reused the reusable tables.
Comment 3 Jungshik Shin 2015-05-12 18:59:41 UTC
(In reply to Anne from comment #1)
> I would have expected that treating them identically for decoding saves you
> a decoding table. Or would you reuse that anyway?

It does not save us anything.  Both tables (GBK and GB18030) would have to be shipped. (unlike Mozilla, ICU does not have two separate tables for encoding and decoding). 

Actually, we need an additional code in Blink [1] to treat encoding and decoding differently for GBK and GB18030 (for toUnicode - identical. for fromUnicode - distinct), which we'd like to avoid if possible. 


> They're treated identically because gbk is effectively a subset and for the
> other encodings we've found that supersets leak. I think there might be some
> anecdotal evidence here too, but not sure.

As I wrote in the previous comment, I suspect that the extent of "leak" (if any) is much smaller in gbk-gb18030 than other cases. 

[1] It might be possible to do this in ICU as well, but I don't want to make a patch to ICU (that is hard to upstream because I don't have a good justification).
Comment 4 Jungshik Shin 2015-05-12 20:57:08 UTC
Hmm.. What's required for ICU can be simpler than I thought (without actually looking) if we *only* care about the conversion and might be acceptable by the upstream. In that case, we can save almost 100kB (by dropping the GBK table entirely). A strawman CL for Chrome's ICU is at https://codereview.chromium.org/1141463003/. I haven't even compiled it, though.
Comment 5 Jungshik Shin 2015-05-29 22:00:50 UTC
(In reply to Jungshik Shin from comment #4)
> Hmm.. What's required for ICU can be simpler than I thought (without
> actually looking) if we *only* care about the conversion and might be
> acceptable by the upstream. In that case, we can save almost 100kB (by
> dropping the GBK table entirely). A strawman CL for Chrome's ICU is at
> https://codereview.chromium.org/1141463003/. I haven't even compiled it,
> though.

Well, it's not that simple.  It'd be better to do that in Blink. 

However, I think the spec had better be changed to separate GBK and GB18030 completely (in both directions). 

Unlike other pairs or triplets which the spec made synonymous with each other, I don't believe that GBK and GB18030 were used synchronously in the wild partly because none of widely used browsers has treated them as synonyms until Firefox did per the spec recently.
Comment 6 Anne 2015-08-19 12:20:25 UTC
It's not clear to me why having these as distinct for decoding is better.
Comment 7 Anne 2015-08-21 07:35:32 UTC
WONTFIX per comment 6.