This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 16773 - Expand the label list
Summary: Expand the label list
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-04-18 11:30 UTC by Anne
Modified: 2012-11-16 13:50 UTC (History)
4 users (show)

See Also:


Attachments

Description Anne 2012-04-18 11:30:29 UTC
The current draft is rather conservative with labels (intersection of browsers roughly for single-byte encodings). Opera supports a lot more http://wiki.whatwg.org/wiki/Encoding#Labels Do more research here and expand the labels when it makes sense and there is no evidence of harm (e.g. euc_jp meaning euc-jp is trouble).
Comment 1 Anne 2012-04-18 11:57:15 UTC
Some data to start with: http://simon.html5.org/dump/encoding-labels/
Comment 2 Simon Pieters 2012-04-18 15:03:58 UTC
I manually checked with URLs in http://simon.html5.org/dump/encoding-labels/labels-urls.txt up to (but not including) http://id22.fm-p.jp/41/kenzokeiba/ using http://www.rexswain.com/httpview.html and listed URLs that have one of the interesting labels in its first encoding declaration, has non-ASCII bytes, and hasn't disappeared:

www.b2blogger.com/pressroom/tag/%F7%E0%F1%F2%ED%FB%E9%20%EA%EB%E8%E5%ED%F2
www.maisondelapoesie.be/auteurs/auteur.php?id_auteur=1337
eplus.jp/sys/main.jsp?prm=U=41:P34=main:P0=GGWH01:P11=10
cafe.naver.com/yps5
cafe.bssd.or.kr/cafe/index.html?cafe_id=spritus&menu=458&page=6
www.gruenkauf.biz/mtranet/impressum?tempid=UCLH4NLUM1AGZYT2
oyazich.s151.xrea.com/test/read.cgi/etc/1202122384/512
danceuniverse.co.kr/shop/cd.php?ty=n&id=081018130724
www.dllka.com/dll/L/listbox.dll.html
den2.777.cx/blog_top/archive_164.htm
gw.tv/fw/b/pc/Brand.html?mthd=09&SC=0F1&BC=CLE01&A=05&D=00&aid=sys_brand&aid2=&aid3=
search.auction.co.kr/search/listid.aspx?seller=odns&frm=list&category=21000000&tab=1
search.auction.co.kr/search/listid.aspx?seller=odritotu&frm=list&category=16000000&tab=1
itempage3.auction.co.kr/DetailView.aspx?ItemNo=A101214690
corners.auction.co.kr/corner/brand.aspx?brand=211&category=32071001
www.ribbonribbon.com/shop/step1.php?number=3142
forum.nasha.lv/viewtopic.php?p=102443
www.polondom.ru/index.php?page=19&id=381
www.rediff.com/gujarati/2002/apr/19dalal.htm
fullcast.jp/job/SearchWork.do?siteType=01&startIndex=0&areaKind=&branchId=&wideJobClass=&middleJobClass=&freeword=&pageType=02&keyword=0000028&branchId=&groupcorpId=&sortType=1
czudovo.info/what.php?what=%E0%E2%EE%F1%FC&ln=hy&in=from_ru

(Then I realized that the list was still quite long and it would be better to write a script that checks those things.)

These URLs need further analysis as to whether it is better to support a particular label (e.g. because the page only has one encoding decl and uses that encoding), it makes no difference (e.g. because it has a later encoding decl with a supported label that maps to the same encoding), or it's better to *not* support it (e.g. because it has a later (e.g. because it has a later encoding decl with a supported label that maps to a different encoding that the page actually uses).
Comment 3 Simon Pieters 2012-04-19 05:20:56 UTC
(In reply to comment #2)
> (Then I realized that the list was still quite long and it would be better to
> write a script that checks those things.)

Script
http://simon.html5.org/dump/encoding-labels/get-labels-with-non-ascii.py

Result
http://simon.html5.org/dump/encoding-labels/labels-with-non-ascii.zip

> These URLs need further analysis as to whether it is better to support a
> particular label (e.g. because the page only has one encoding decl and uses
> that encoding), it makes no difference (e.g. because it has a later encoding
> decl with a supported label that maps to the same encoding), or it's better to
> *not* support it (e.g. because it has a later (e.g. because it has a later
> encoding decl with a supported label that maps to a different encoding that the
> page actually uses).

Also applies to this.
Comment 4 Simon Pieters 2012-04-19 07:55:20 UTC
Now tried a slightly different approach where the script looks for a later encoding declaration and categorizes the pages:

http://simon.html5.org/dump/encoding-labels/labels-with-nonascii-categorized.txt

From this we can see that the following labels make no difference as to whether they are supported since all pages (in this data set) have a later encoding declaration with a supported alias:

cswindows31j
ms936
iso8859-9
cp1254

And x-mac-turkish appears to have no pages with non-ASCII bytes, so also makes no different whether it's supported.

Pages that have non-zero "pages with later supported decl for other encoding" are possibly better to not support, but this needs investigation by checking the URLs manually.
Comment 5 Simon Pieters 2012-04-19 08:27:44 UTC
x-user-defined pages seem to be broken in my browsers; possibly it should be recognized as "use the locale-dependent fallback encoding" since some pages had a later iso-8859-1 declaration but was not actually encoded in iso-8859-1.
Comment 6 Simon Pieters 2012-04-19 08:35:53 UTC
cp1251 appears to not make any difference whether it's supported for this data set.
Comment 7 Simon Pieters 2012-04-19 08:37:30 UTC
Same with cp1250
Comment 8 Simon Pieters 2012-04-19 08:53:07 UTC
Same with cp1252

It seems many pages have changed or disappeared, so this dataset isn't too useful for checking live pages. :-(
Comment 9 Simon Pieters 2012-04-19 09:16:08 UTC
http://tyosaku.hanrei.jp/detailPageLink/cr/6%8F%F01%8D%80+%98Z%8F%F0%88ꍀ/page0.html has MS932 in HTTP header and MS-932 in <meta> and appears to use shift_jis. Neither of those labels are in the spec.
Comment 11 Simon Pieters 2012-04-19 09:34:59 UTC
The following pages use ks_c_5601-1987 in HTTP header and MS949 in <meta>:

http://blog.naver.com/hyoung307/20059659021 
http://blog.naver.com/runeslove

The following page uses ks_c_5601-1987 in both HTTP header and <meta>:

http://cad.daoudata.co.kr/index.php?part=product&code=news&mode=view&idx=13&start=0&s_mode=&ct1=
Comment 12 Simon Pieters 2012-04-19 09:46:02 UTC
More ks_c_5601-1987: seemingly all of *.auction.co.kr (31 pages in this sample)
Comment 13 Simon Pieters 2012-04-19 09:55:05 UTC
OK I've looked through the data, can't get futher with this. It seems safe to conclude that ks_c_5601-1987 and sjis should be added to the spec. For the other labels, we need to do a new study with fresh data, I think.
Comment 15 Anne 2012-10-11 15:15:48 UTC
x-user-defined has meanwhile been added as well per feedback from hsivonen.
Comment 16 Anne 2012-11-16 13:50:07 UTC
Closing this. I people feel more work needs to be done lets open a dedicated bug for that.