<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>16773</bug_id>
          
          <creation_ts>2012-04-18 11:30:29 +0000</creation_ts>
          <short_desc>Expand the label list</short_desc>
          <delta_ts>2012-11-16 13:50:07 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>Encoding</component>
          <version>unspecified</version>
          <rep_platform>PC</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Anne">annevk</reporter>
          <assigned_to name="Anne">annevk</assigned_to>
          <cc>jirka</cc>
    
    <cc>mike</cc>
    
    <cc>philipj</cc>
    
    <cc>zcorpan</cc>
          
          <qa_contact>sideshowbarker+encodingspec</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>66791</commentid>
    <comment_count>0</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2012-04-18 11:30:29 +0000</bug_when>
    <thetext>The current draft is rather conservative with labels (intersection of browsers roughly for single-byte encodings). Opera supports a lot more http://wiki.whatwg.org/wiki/Encoding#Labels Do more research here and expand the labels when it makes sense and there is no evidence of harm (e.g. euc_jp meaning euc-jp is trouble).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66793</commentid>
    <comment_count>1</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2012-04-18 11:57:15 +0000</bug_when>
    <thetext>Some data to start with: http://simon.html5.org/dump/encoding-labels/</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66798</commentid>
    <comment_count>2</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2012-04-18 15:03:58 +0000</bug_when>
    <thetext>I manually checked with URLs in http://simon.html5.org/dump/encoding-labels/labels-urls.txt up to (but not including) http://id22.fm-p.jp/41/kenzokeiba/ using http://www.rexswain.com/httpview.html and listed URLs that have one of the interesting labels in its first encoding declaration, has non-ASCII bytes, and hasn&apos;t disappeared:

www.b2blogger.com/pressroom/tag/%F7%E0%F1%F2%ED%FB%E9%20%EA%EB%E8%E5%ED%F2
www.maisondelapoesie.be/auteurs/auteur.php?id_auteur=1337
eplus.jp/sys/main.jsp?prm=U=41:P34=main:P0=GGWH01:P11=10
cafe.naver.com/yps5
cafe.bssd.or.kr/cafe/index.html?cafe_id=spritus&amp;menu=458&amp;page=6
www.gruenkauf.biz/mtranet/impressum?tempid=UCLH4NLUM1AGZYT2
oyazich.s151.xrea.com/test/read.cgi/etc/1202122384/512
danceuniverse.co.kr/shop/cd.php?ty=n&amp;id=081018130724
www.dllka.com/dll/L/listbox.dll.html
den2.777.cx/blog_top/archive_164.htm
gw.tv/fw/b/pc/Brand.html?mthd=09&amp;SC=0F1&amp;BC=CLE01&amp;A=05&amp;D=00&amp;aid=sys_brand&amp;aid2=&amp;aid3=
search.auction.co.kr/search/listid.aspx?seller=odns&amp;frm=list&amp;category=21000000&amp;tab=1
search.auction.co.kr/search/listid.aspx?seller=odritotu&amp;frm=list&amp;category=16000000&amp;tab=1
itempage3.auction.co.kr/DetailView.aspx?ItemNo=A101214690
corners.auction.co.kr/corner/brand.aspx?brand=211&amp;category=32071001
www.ribbonribbon.com/shop/step1.php?number=3142
forum.nasha.lv/viewtopic.php?p=102443
www.polondom.ru/index.php?page=19&amp;id=381
www.rediff.com/gujarati/2002/apr/19dalal.htm
fullcast.jp/job/SearchWork.do?siteType=01&amp;startIndex=0&amp;areaKind=&amp;branchId=&amp;wideJobClass=&amp;middleJobClass=&amp;freeword=&amp;pageType=02&amp;keyword=0000028&amp;branchId=&amp;groupcorpId=&amp;sortType=1
czudovo.info/what.php?what=%E0%E2%EE%F1%FC&amp;ln=hy&amp;in=from_ru

(Then I realized that the list was still quite long and it would be better to write a script that checks those things.)

These URLs need further analysis as to whether it is better to support a particular label (e.g. because the page only has one encoding decl and uses that encoding), it makes no difference (e.g. because it has a later encoding decl with a supported label that maps to the same encoding), or it&apos;s better to *not* support it (e.g. because it has a later (e.g. because it has a later encoding decl with a supported label that maps to a different encoding that the page actually uses).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66838</commentid>
    <comment_count>3</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2012-04-19 05:20:56 +0000</bug_when>
    <thetext>(In reply to comment #2)
&gt; (Then I realized that the list was still quite long and it would be better to
&gt; write a script that checks those things.)

Script
http://simon.html5.org/dump/encoding-labels/get-labels-with-non-ascii.py

Result
http://simon.html5.org/dump/encoding-labels/labels-with-non-ascii.zip

&gt; These URLs need further analysis as to whether it is better to support a
&gt; particular label (e.g. because the page only has one encoding decl and uses
&gt; that encoding), it makes no difference (e.g. because it has a later encoding
&gt; decl with a supported label that maps to the same encoding), or it&apos;s better to
&gt; *not* support it (e.g. because it has a later (e.g. because it has a later
&gt; encoding decl with a supported label that maps to a different encoding that the
&gt; page actually uses).

Also applies to this.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66840</commentid>
    <comment_count>4</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2012-04-19 07:55:20 +0000</bug_when>
    <thetext>Now tried a slightly different approach where the script looks for a later encoding declaration and categorizes the pages:

http://simon.html5.org/dump/encoding-labels/labels-with-nonascii-categorized.txt

From this we can see that the following labels make no difference as to whether they are supported since all pages (in this data set) have a later encoding declaration with a supported alias:

cswindows31j
ms936
iso8859-9
cp1254

And x-mac-turkish appears to have no pages with non-ASCII bytes, so also makes no different whether it&apos;s supported.

Pages that have non-zero &quot;pages with later supported decl for other encoding&quot; are possibly better to not support, but this needs investigation by checking the URLs manually.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66841</commentid>
    <comment_count>5</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2012-04-19 08:27:44 +0000</bug_when>
    <thetext>x-user-defined pages seem to be broken in my browsers; possibly it should be recognized as &quot;use the locale-dependent fallback encoding&quot; since some pages had a later iso-8859-1 declaration but was not actually encoded in iso-8859-1.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66842</commentid>
    <comment_count>6</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2012-04-19 08:35:53 +0000</bug_when>
    <thetext>cp1251 appears to not make any difference whether it&apos;s supported for this data set.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66844</commentid>
    <comment_count>7</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2012-04-19 08:37:30 +0000</bug_when>
    <thetext>Same with cp1250</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66845</commentid>
    <comment_count>8</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2012-04-19 08:53:07 +0000</bug_when>
    <thetext>Same with cp1252

It seems many pages have changed or disappeared, so this dataset isn&apos;t too useful for checking live pages. :-(</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66847</commentid>
    <comment_count>9</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2012-04-19 09:16:08 +0000</bug_when>
    <thetext>http://tyosaku.hanrei.jp/detailPageLink/cr/6%8F%F01%8D%80+%98Z%8F%F0%88ꍀ/page0.html has MS932 in HTTP header and MS-932 in &lt;meta&gt; and appears to use shift_jis. Neither of those labels are in the spec.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66849</commentid>
    <comment_count>10</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2012-04-19 09:26:47 +0000</bug_when>
    <thetext>There are some live pages that use the &apos;sjis&apos; label and use shift_jis encoding

http://oyazich.s151.xrea.com/test/read.cgi/etc/1202122384/512
http://eiyou-s.com/new_item/s-2-5-12-62.html
http://otakara.chips.jp/carryover/aisyou21.php
http://bbs.kokugakuin.info/test/r.cgi/syumi/1096090311/l10</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66851</commentid>
    <comment_count>11</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2012-04-19 09:34:59 +0000</bug_when>
    <thetext>The following pages use ks_c_5601-1987 in HTTP header and MS949 in &lt;meta&gt;:

http://blog.naver.com/hyoung307/20059659021 
http://blog.naver.com/runeslove

The following page uses ks_c_5601-1987 in both HTTP header and &lt;meta&gt;:

http://cad.daoudata.co.kr/index.php?part=product&amp;code=news&amp;mode=view&amp;idx=13&amp;start=0&amp;s_mode=&amp;ct1=</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66852</commentid>
    <comment_count>12</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2012-04-19 09:46:02 +0000</bug_when>
    <thetext>More ks_c_5601-1987: seemingly all of *.auction.co.kr (31 pages in this sample)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66853</commentid>
    <comment_count>13</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2012-04-19 09:55:05 +0000</bug_when>
    <thetext>OK I&apos;ve looked through the data, can&apos;t get futher with this. It seems safe to conclude that ks_c_5601-1987 and sjis should be added to the spec. For the other labels, we need to do a new study with fresh data, I think.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66857</commentid>
    <comment_count>14</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2012-04-19 13:40:45 +0000</bug_when>
    <thetext>http://dvcs.w3.org/hg/encoding/rev/d3ea478b3c73</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>75991</commentid>
    <comment_count>15</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2012-10-11 15:15:48 +0000</bug_when>
    <thetext>x-user-defined has meanwhile been added as well per feedback from hsivonen.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>78410</commentid>
    <comment_count>16</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2012-11-16 13:50:07 +0000</bug_when>
    <thetext>Closing this. I people feel more work needs to be done lets open a dedicated bug for that.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>