This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
The current draft is rather conservative with labels (intersection of browsers roughly for single-byte encodings). Opera supports a lot more http://wiki.whatwg.org/wiki/Encoding#Labels Do more research here and expand the labels when it makes sense and there is no evidence of harm (e.g. euc_jp meaning euc-jp is trouble).
Some data to start with: http://simon.html5.org/dump/encoding-labels/
I manually checked with URLs in http://simon.html5.org/dump/encoding-labels/labels-urls.txt up to (but not including) http://id22.fm-p.jp/41/kenzokeiba/ using http://www.rexswain.com/httpview.html and listed URLs that have one of the interesting labels in its first encoding declaration, has non-ASCII bytes, and hasn't disappeared: www.b2blogger.com/pressroom/tag/%F7%E0%F1%F2%ED%FB%E9%20%EA%EB%E8%E5%ED%F2 www.maisondelapoesie.be/auteurs/auteur.php?id_auteur=1337 eplus.jp/sys/main.jsp?prm=U=41:P34=main:P0=GGWH01:P11=10 cafe.naver.com/yps5 cafe.bssd.or.kr/cafe/index.html?cafe_id=spritus&menu=458&page=6 www.gruenkauf.biz/mtranet/impressum?tempid=UCLH4NLUM1AGZYT2 oyazich.s151.xrea.com/test/read.cgi/etc/1202122384/512 danceuniverse.co.kr/shop/cd.php?ty=n&id=081018130724 www.dllka.com/dll/L/listbox.dll.html den2.777.cx/blog_top/archive_164.htm gw.tv/fw/b/pc/Brand.html?mthd=09&SC=0F1&BC=CLE01&A=05&D=00&aid=sys_brand&aid2=&aid3= search.auction.co.kr/search/listid.aspx?seller=odns&frm=list&category=21000000&tab=1 search.auction.co.kr/search/listid.aspx?seller=odritotu&frm=list&category=16000000&tab=1 itempage3.auction.co.kr/DetailView.aspx?ItemNo=A101214690 corners.auction.co.kr/corner/brand.aspx?brand=211&category=32071001 www.ribbonribbon.com/shop/step1.php?number=3142 forum.nasha.lv/viewtopic.php?p=102443 www.polondom.ru/index.php?page=19&id=381 www.rediff.com/gujarati/2002/apr/19dalal.htm fullcast.jp/job/SearchWork.do?siteType=01&startIndex=0&areaKind=&branchId=&wideJobClass=&middleJobClass=&freeword=&pageType=02&keyword=0000028&branchId=&groupcorpId=&sortType=1 czudovo.info/what.php?what=%E0%E2%EE%F1%FC&ln=hy&in=from_ru (Then I realized that the list was still quite long and it would be better to write a script that checks those things.) These URLs need further analysis as to whether it is better to support a particular label (e.g. because the page only has one encoding decl and uses that encoding), it makes no difference (e.g. because it has a later encoding decl with a supported label that maps to the same encoding), or it's better to *not* support it (e.g. because it has a later (e.g. because it has a later encoding decl with a supported label that maps to a different encoding that the page actually uses).
(In reply to comment #2) > (Then I realized that the list was still quite long and it would be better to > write a script that checks those things.) Script http://simon.html5.org/dump/encoding-labels/get-labels-with-non-ascii.py Result http://simon.html5.org/dump/encoding-labels/labels-with-non-ascii.zip > These URLs need further analysis as to whether it is better to support a > particular label (e.g. because the page only has one encoding decl and uses > that encoding), it makes no difference (e.g. because it has a later encoding > decl with a supported label that maps to the same encoding), or it's better to > *not* support it (e.g. because it has a later (e.g. because it has a later > encoding decl with a supported label that maps to a different encoding that the > page actually uses). Also applies to this.
Now tried a slightly different approach where the script looks for a later encoding declaration and categorizes the pages: http://simon.html5.org/dump/encoding-labels/labels-with-nonascii-categorized.txt From this we can see that the following labels make no difference as to whether they are supported since all pages (in this data set) have a later encoding declaration with a supported alias: cswindows31j ms936 iso8859-9 cp1254 And x-mac-turkish appears to have no pages with non-ASCII bytes, so also makes no different whether it's supported. Pages that have non-zero "pages with later supported decl for other encoding" are possibly better to not support, but this needs investigation by checking the URLs manually.
x-user-defined pages seem to be broken in my browsers; possibly it should be recognized as "use the locale-dependent fallback encoding" since some pages had a later iso-8859-1 declaration but was not actually encoded in iso-8859-1.
cp1251 appears to not make any difference whether it's supported for this data set.
Same with cp1250
Same with cp1252 It seems many pages have changed or disappeared, so this dataset isn't too useful for checking live pages. :-(
http://tyosaku.hanrei.jp/detailPageLink/cr/6%8F%F01%8D%80+%98Z%8F%F0%88ꍀ/page0.html has MS932 in HTTP header and MS-932 in <meta> and appears to use shift_jis. Neither of those labels are in the spec.
There are some live pages that use the 'sjis' label and use shift_jis encoding http://oyazich.s151.xrea.com/test/read.cgi/etc/1202122384/512 http://eiyou-s.com/new_item/s-2-5-12-62.html http://otakara.chips.jp/carryover/aisyou21.php http://bbs.kokugakuin.info/test/r.cgi/syumi/1096090311/l10
The following pages use ks_c_5601-1987 in HTTP header and MS949 in <meta>: http://blog.naver.com/hyoung307/20059659021 http://blog.naver.com/runeslove The following page uses ks_c_5601-1987 in both HTTP header and <meta>: http://cad.daoudata.co.kr/index.php?part=product&code=news&mode=view&idx=13&start=0&s_mode=&ct1=
More ks_c_5601-1987: seemingly all of *.auction.co.kr (31 pages in this sample)
OK I've looked through the data, can't get futher with this. It seems safe to conclude that ks_c_5601-1987 and sjis should be added to the spec. For the other labels, we need to do a new study with fresh data, I think.
http://dvcs.w3.org/hg/encoding/rev/d3ea478b3c73
x-user-defined has meanwhile been added as well per feedback from hsivonen.
Closing this. I people feel more work needs to be done lets open a dedicated bug for that.