This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Philip suggests we remove the gb18030 flag and get on with it. I don't mind, but we'd a) need to decide on a name (either gbk or gb18030) and b) see if implementors are willing and probably c) somehow figure out if this is worth the simplification.
As far as the name I think you'd want it to be gb18030, since that's the current standard and it's meant to replace/supersede gbk.
I played a little in https://gitorious.org/whatwg/big5/commits/gb Out of 449292 URLs (cn-urls.txt) 400022 were successfully fetched. Running test-gb.py on those found these URLs using valid GB18030 triples: http://www.career.cun.edu.cn/jyw/index.jsp Content-Type: text/html; charset=gbk meta: <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=GBK" name="Keywords" content="就业网,就业中心"/> http://www.f5.com.cn/press/20090803a.html Content-Type: text/html meta: <meta http-equiv="content-type" content="text/html;charset=gb2312" /> http://portal.bisu.edu.cn/portal/jwc Content-Type: text/html; charset=gb18030 meta: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> http://www.zyol.gz.cn/wenzhang1.php?id=247233 Content-Type: text/html meta: <meta http-equiv="Content-Type" content="text/html; charset=gb2312"> http://www.nicpbp.org.cn/CL0452/ Content-Type: text/html meta: <META http-equiv=Content-Type content="text/html; charset=gb2312"> http://www.qhmc.edu.cn/index/news/4/html/qhmc83.htm Content-Type: text/html meta: <META content="text/html; charset=gb2312" http-equiv=Content-Type> http://www.f5.com.cn/press/20081027.html Content-Type: text/html meta: <meta http-equiv="content-type" content="text/html;charset=gb2312" /> http://en.nefu.edu.cn/oc.php Content-Type: text/html meta: <meta http-equiv="Content-Type" content="text/html; charset=gb18030">
Analysis: > http://www.career.cun.edu.cn/jyw/index.jsp > Content-Type: text/html; charset=gbk > meta: <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=GBK" > name="Keywords" content="就业网,就业中心"/> Plenty of doubly misencoded nonsense like: <TD style="FONT-SIZE: 12px; COLOR: #ff3300; FONT-FAMILY: Verdana, ËÎÌå" width=467 background=/jyw/images/cau_13.gif> >>> 'ËÎÌå'.encode('latin1').decode('gbk') '宋体' > http://www.f5.com.cn/press/20090803a.html > Content-Type: text/html > meta: <meta http-equiv="content-type" content="text/html;charset=gb2312" / ® and ™ > http://portal.bisu.edu.cn/portal/jwc > Content-Type: text/html; charset=gb18030 > meta: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" / U+FEFF (Incorrectly labeled as UTF-8 in <meta>) > http://www.zyol.gz.cn/wenzhang1.php?id=247233 > Content-Type: text/html > meta: <meta http-equiv="Content-Type" content="text/html; charset=gb2312"> Severely misencoded stuff. > http://www.nicpbp.org.cn/CL0452/ > Content-Type: text/html > meta: <META http-equiv=Content-Type content="text/html; charset=gb2312"> Explicitly encoded U+FFFD, it seems. > http://www.qhmc.edu.cn/index/news/4/html/qhmc83.htm > Content-Type: text/html > meta: <META content="text/html; charset=gb2312" http-equiv=Content-Type> U+FEFF and © (in contexts that don't really make sense) > http://www.f5.com.cn/press/20081027.html > Content-Type: text/html > meta: <meta http-equiv="content-type" content="text/html;charset=gb2312" /> ® and ™ > http://en.nefu.edu.cn/oc.php > Content-Type: text/html > meta: <meta http-equiv="Content-Type" content="text/html; charset=gb18030"> U+00A0 encoded as a triple for no good reason, but correctly labeled so it doesn't matter.
Conclusion: It doesn't really matter for decoding, as actual GB18030 content seems to be extremely rare. Of the pages not labeled as GB18030, only f5.com is clearly intentional usage. My assumption was that GB18030 content labeled as GBK would be common, but since it is not merging them will likely mask a lot more decoding errors than it fixes.
Thanks Philip!