27235 – Bring back gbk encoder

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 27235 - Bring back gbk encoder

Summary: Bring back gbk encoder

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-11-04 19:43 UTC by Anne
Modified:	2015-03-06 18:51 UTC (History)
CC List:	6 users (show)

See Also:

Attachments

Description Anne 2014-11-04 19:43:19 UTC

Firefox ended up not following the plan from bug 16862 comment 18. Its gbk decoder is identical to its gb18030 decoder, but its gbk encoder per https://bugzilla.mozilla.org/show_bug.cgi?id=951691 is distinct.

So we should probably bring the gbk encoder back. When fixing this we should pay attention to the EURO sign and PUA code points. See

  https://bugzilla.mozilla.org/show_bug.cgi?id=951691#c16
  https://bugzilla.mozilla.org/show_bug.cgi?id=951691#c19

Having said that, if other browsers meanwhile converged on not having a distinct gbk encoder, perhaps Firefox should revisit its approach. Input welcome.

Comment 1 Joshua Bell 2014-11-04 20:59:05 UTC

Data point: Chromium has NOT aligned with the Encoding standard here.

Our tracking bug is http://crbug.com/339862

As usual, Jungshik has a lot more context than I do, but we were definitely hesitant about trying to make this change.

Comment 2 Anne 2014-11-08 10:22:47 UTC

Anticipated changes:

* Partially revert https://github.com/whatwg/encoding/commit/182ad9e607a7c6f0fa51d9dd6c638edaa5ec59fd to restore gb18030 as independent encoding with a single label, and gbk as independent encoding with nine labels.
* Map gbk's decoder to gb18030's decoder (no flags).
* Introduce a flag for gb18030's encoder that limits it to what gbk can output. (Still need to look into € and PUA.)
* Use that flag to define gbk's encoder.

(Per that commit we apparently historically defined gb18030 in terms of gbk, but that doesn't make much sense. So now we'll define gbk as a subset of gb18030.)

Comment 3 Anne 2014-11-08 19:52:51 UTC

https://github.com/whatwg/encoding/commit/c8838716fc6f575f50506e5b82f12c434b5be6bb

(It turns out that gbk supports the same PUA code points as far as I can tell.)

Comment 4 Jungshik Shin 2014-11-10 07:01:18 UTC

Sorry that I didn't get back here in a timely manner. I was out on internal/external conferences last week. Chromium was hesitant, but I've been considering merging gbk and gb18030 per spec before the latest revision. 

Moreover, latest revision made it a bit hard to implement GBK/GB18030 without touching the ICU's gb18030 implementation (even though I agree to the approach; 1. decoding is identical for both encodings 2. gbk encoding is limited to 'the gbk subset').  I've just read the latest revision and it's just my first thought. There might be an easier way. I'll give more thought to it.

Comment 5 Jungshik Shin 2015-03-06 18:51:45 UTC

I filed bug 28156 suggesting that GBK and GB18030 be completely separated even when decoding (toUnicode).