This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 16692 - merge gbk and gb18030
Summary: merge gbk and gb18030
Status: RESOLVED WONTFIX
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-04-10 21:17 UTC by Anne
Modified: 2012-10-30 17:13 UTC (History)
2 users (show)

See Also:


Attachments

Description Anne 2012-04-10 21:17:11 UTC
Philip suggests we remove the gb18030 flag and get on with it. I don't mind, but we'd a) need to decide on a name (either gbk or gb18030) and b) see if implementors are willing and probably c) somehow figure out if this is worth the simplification.
Comment 1 Michael[tm] Smith 2012-04-11 03:30:28 UTC
As far as the name I think you'd want it to be gb18030, since that's the current standard and it's meant to replace/supersede gbk.
Comment 2 Philip Jägenstedt 2012-04-30 11:50:08 UTC
I played a little in https://gitorious.org/whatwg/big5/commits/gb

Out of 449292 URLs (cn-urls.txt) 400022 were successfully fetched. Running test-gb.py on those found these URLs using valid GB18030 triples:

http://www.career.cun.edu.cn/jyw/index.jsp
	Content-Type: text/html; charset=gbk
	meta: <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=GBK" name="Keywords" content="就业网,就业中心"/>

http://www.f5.com.cn/press/20090803a.html
	Content-Type: text/html
	meta: <meta http-equiv="content-type" content="text/html;charset=gb2312" />

http://portal.bisu.edu.cn/portal/jwc
	Content-Type: text/html; charset=gb18030
	meta: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

http://www.zyol.gz.cn/wenzhang1.php?id=247233
	Content-Type: text/html
	meta: <meta http-equiv="Content-Type" content="text/html; charset=gb2312">

http://www.nicpbp.org.cn/CL0452/
	Content-Type: text/html
	meta: <META http-equiv=Content-Type content="text/html; charset=gb2312">

http://www.qhmc.edu.cn/index/news/4/html/qhmc83.htm
	Content-Type: text/html
	meta: <META content="text/html; charset=gb2312" http-equiv=Content-Type>

http://www.f5.com.cn/press/20081027.html
	Content-Type: text/html
	meta: <meta http-equiv="content-type" content="text/html;charset=gb2312" />

http://en.nefu.edu.cn/oc.php
	Content-Type: text/html
	meta: <meta http-equiv="Content-Type" content="text/html; charset=gb18030">
Comment 3 Philip Jägenstedt 2012-04-30 12:30:30 UTC
Analysis:

> http://www.career.cun.edu.cn/jyw/index.jsp
>     Content-Type: text/html; charset=gbk
>     meta: <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=GBK"
> name="Keywords" content="就业网,就业中心"/>

Plenty of doubly misencoded nonsense like:

<TD style="FONT-SIZE: 12px; COLOR: #ff3300; FONT-FAMILY: Verdana, ËÎÌå" width=467 background=/jyw/images/cau_13.gif>

>>> 'ËÎÌå'.encode('latin1').decode('gbk')
'宋体'

> http://www.f5.com.cn/press/20090803a.html
>     Content-Type: text/html
>     meta: <meta http-equiv="content-type" content="text/html;charset=gb2312" /

® and ™

> http://portal.bisu.edu.cn/portal/jwc
>     Content-Type: text/html; charset=gb18030
>     meta: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /

U+FEFF (Incorrectly labeled as UTF-8 in <meta>)

> http://www.zyol.gz.cn/wenzhang1.php?id=247233
>     Content-Type: text/html
>     meta: <meta http-equiv="Content-Type" content="text/html; charset=gb2312">

Severely misencoded stuff.

> http://www.nicpbp.org.cn/CL0452/
>     Content-Type: text/html
>     meta: <META http-equiv=Content-Type content="text/html; charset=gb2312">

Explicitly encoded U+FFFD, it seems.

> http://www.qhmc.edu.cn/index/news/4/html/qhmc83.htm
>     Content-Type: text/html
>     meta: <META content="text/html; charset=gb2312" http-equiv=Content-Type>

U+FEFF and © (in contexts that don't really make sense)

> http://www.f5.com.cn/press/20081027.html
>     Content-Type: text/html
>     meta: <meta http-equiv="content-type" content="text/html;charset=gb2312" />

® and ™

> http://en.nefu.edu.cn/oc.php
>     Content-Type: text/html
>     meta: <meta http-equiv="Content-Type" content="text/html; charset=gb18030">

U+00A0 encoded as a triple for no good reason, but correctly labeled so it doesn't matter.
Comment 4 Philip Jägenstedt 2012-04-30 12:38:34 UTC
Conclusion:

It doesn't really matter for decoding, as actual GB18030 content seems to be extremely rare. Of the pages not labeled as GB18030, only f5.com is clearly intentional usage.

My assumption was that GB18030 content labeled as GBK would be common, but since it is not merging them will likely mask a lot more decoding errors than it fixes.
Comment 5 Anne 2012-10-11 15:10:54 UTC
Thanks Philip!