<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>16692</bug_id>
          
          <creation_ts>2012-04-10 21:17:11 +0000</creation_ts>
          <short_desc>merge gbk and gb18030</short_desc>
          <delta_ts>2012-10-30 17:13:13 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>Encoding</component>
          <version>unspecified</version>
          <rep_platform>PC</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>WONTFIX</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Anne">annevk</reporter>
          <assigned_to name="Anne">annevk</assigned_to>
          <cc>mike</cc>
    
    <cc>philipj</cc>
          
          <qa_contact>sideshowbarker+encodingspec</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>66592</commentid>
    <comment_count>0</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2012-04-10 21:17:11 +0000</bug_when>
    <thetext>Philip suggests we remove the gb18030 flag and get on with it. I don&apos;t mind, but we&apos;d a) need to decide on a name (either gbk or gb18030) and b) see if implementors are willing and probably c) somehow figure out if this is worth the simplification.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66602</commentid>
    <comment_count>1</comment_count>
    <who name="Michael[tm] Smith">mike</who>
    <bug_when>2012-04-11 03:30:28 +0000</bug_when>
    <thetext>As far as the name I think you&apos;d want it to be gb18030, since that&apos;s the current standard and it&apos;s meant to replace/supersede gbk.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>67184</commentid>
    <comment_count>2</comment_count>
    <who name="Philip Jägenstedt">philipj</who>
    <bug_when>2012-04-30 11:50:08 +0000</bug_when>
    <thetext>I played a little in https://gitorious.org/whatwg/big5/commits/gb

Out of 449292 URLs (cn-urls.txt) 400022 were successfully fetched. Running test-gb.py on those found these URLs using valid GB18030 triples:

http://www.career.cun.edu.cn/jyw/index.jsp
	Content-Type: text/html; charset=gbk
	meta: &lt;META HTTP-EQUIV=&quot;Content-Type&quot; CONTENT=&quot;text/html; charset=GBK&quot; name=&quot;Keywords&quot; content=&quot;就业网,就业中心&quot;/&gt;

http://www.f5.com.cn/press/20090803a.html
	Content-Type: text/html
	meta: &lt;meta http-equiv=&quot;content-type&quot; content=&quot;text/html;charset=gb2312&quot; /&gt;

http://portal.bisu.edu.cn/portal/jwc
	Content-Type: text/html; charset=gb18030
	meta: &lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=UTF-8&quot; /&gt;

http://www.zyol.gz.cn/wenzhang1.php?id=247233
	Content-Type: text/html
	meta: &lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=gb2312&quot;&gt;

http://www.nicpbp.org.cn/CL0452/
	Content-Type: text/html
	meta: &lt;META http-equiv=Content-Type content=&quot;text/html; charset=gb2312&quot;&gt;

http://www.qhmc.edu.cn/index/news/4/html/qhmc83.htm
	Content-Type: text/html
	meta: &lt;META content=&quot;text/html; charset=gb2312&quot; http-equiv=Content-Type&gt;

http://www.f5.com.cn/press/20081027.html
	Content-Type: text/html
	meta: &lt;meta http-equiv=&quot;content-type&quot; content=&quot;text/html;charset=gb2312&quot; /&gt;

http://en.nefu.edu.cn/oc.php
	Content-Type: text/html
	meta: &lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=gb18030&quot;&gt;</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>67185</commentid>
    <comment_count>3</comment_count>
    <who name="Philip Jägenstedt">philipj</who>
    <bug_when>2012-04-30 12:30:30 +0000</bug_when>
    <thetext>Analysis:

&gt; http://www.career.cun.edu.cn/jyw/index.jsp
&gt;     Content-Type: text/html; charset=gbk
&gt;     meta: &lt;META HTTP-EQUIV=&quot;Content-Type&quot; CONTENT=&quot;text/html; charset=GBK&quot;
&gt; name=&quot;Keywords&quot; content=&quot;就业网,就业中心&quot;/&gt;

Plenty of doubly misencoded nonsense like:

&lt;TD style=&quot;FONT-SIZE: 12px; COLOR: #ff3300; FONT-FAMILY: Verdana, ËÎÌå&quot; width=467 background=/jyw/images/cau_13.gif&gt;

&gt;&gt;&gt; &apos;ËÎÌå&apos;.encode(&apos;latin1&apos;).decode(&apos;gbk&apos;)
&apos;宋体&apos;

&gt; http://www.f5.com.cn/press/20090803a.html
&gt;     Content-Type: text/html
&gt;     meta: &lt;meta http-equiv=&quot;content-type&quot; content=&quot;text/html;charset=gb2312&quot; /

® and ™

&gt; http://portal.bisu.edu.cn/portal/jwc
&gt;     Content-Type: text/html; charset=gb18030
&gt;     meta: &lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=UTF-8&quot; /

U+FEFF (Incorrectly labeled as UTF-8 in &lt;meta&gt;)

&gt; http://www.zyol.gz.cn/wenzhang1.php?id=247233
&gt;     Content-Type: text/html
&gt;     meta: &lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=gb2312&quot;&gt;

Severely misencoded stuff.

&gt; http://www.nicpbp.org.cn/CL0452/
&gt;     Content-Type: text/html
&gt;     meta: &lt;META http-equiv=Content-Type content=&quot;text/html; charset=gb2312&quot;&gt;

Explicitly encoded U+FFFD, it seems.

&gt; http://www.qhmc.edu.cn/index/news/4/html/qhmc83.htm
&gt;     Content-Type: text/html
&gt;     meta: &lt;META content=&quot;text/html; charset=gb2312&quot; http-equiv=Content-Type&gt;

U+FEFF and © (in contexts that don&apos;t really make sense)

&gt; http://www.f5.com.cn/press/20081027.html
&gt;     Content-Type: text/html
&gt;     meta: &lt;meta http-equiv=&quot;content-type&quot; content=&quot;text/html;charset=gb2312&quot; /&gt;

® and ™

&gt; http://en.nefu.edu.cn/oc.php
&gt;     Content-Type: text/html
&gt;     meta: &lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=gb18030&quot;&gt;

U+00A0 encoded as a triple for no good reason, but correctly labeled so it doesn&apos;t matter.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>67186</commentid>
    <comment_count>4</comment_count>
    <who name="Philip Jägenstedt">philipj</who>
    <bug_when>2012-04-30 12:38:34 +0000</bug_when>
    <thetext>Conclusion:

It doesn&apos;t really matter for decoding, as actual GB18030 content seems to be extremely rare. Of the pages not labeled as GB18030, only f5.com is clearly intentional usage.

My assumption was that GB18030 content labeled as GBK would be common, but since it is not merging them will likely mask a lot more decoding errors than it fixes.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>75988</commentid>
    <comment_count>5</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2012-10-11 15:10:54 +0000</bug_when>
    <thetext>Thanks Philip!</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>