16773 – Expand the label list

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 16773 - Expand the label list

Summary: Expand the label list

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-04-18 11:30 UTC by Anne
Modified:	2012-11-16 13:50 UTC (History)
CC List:	4 users (show)

See Also:

Attachments

Description Anne 2012-04-18 11:30:29 UTC

The current draft is rather conservative with labels (intersection of browsers roughly for single-byte encodings). Opera supports a lot more http://wiki.whatwg.org/wiki/Encoding#Labels Do more research here and expand the labels when it makes sense and there is no evidence of harm (e.g. euc_jp meaning euc-jp is trouble).

Comment 1 Anne 2012-04-18 11:57:15 UTC

Some data to start with: http://simon.html5.org/dump/encoding-labels/

Comment 2 Simon Pieters 2012-04-18 15:03:58 UTC

I manually checked with URLs in http://simon.html5.org/dump/encoding-labels/labels-urls.txt up to (but not including) http://id22.fm-p.jp/41/kenzokeiba/ using http://www.rexswain.com/httpview.html and listed URLs that have one of the interesting labels in its first encoding declaration, has non-ASCII bytes, and hasn't disappeared:

www.b2blogger.com/pressroom/tag/%F7%E0%F1%F2%ED%FB%E9%20%EA%EB%E8%E5%ED%F2
www.maisondelapoesie.be/auteurs/auteur.php?id_auteur=1337
eplus.jp/sys/main.jsp?prm=U=41:P34=main:P0=GGWH01:P11=10
cafe.naver.com/yps5
cafe.bssd.or.kr/cafe/index.html?cafe_id=spritus&menu=458&page=6
www.gruenkauf.biz/mtranet/impressum?tempid=UCLH4NLUM1AGZYT2
oyazich.s151.xrea.com/test/read.cgi/etc/1202122384/512
danceuniverse.co.kr/shop/cd.php?ty=n&id=081018130724
www.dllka.com/dll/L/listbox.dll.html
den2.777.cx/blog_top/archive_164.htm
gw.tv/fw/b/pc/Brand.html?mthd=09&SC=0F1&BC=CLE01&A=05&D=00&aid=sys_brand&aid2=&aid3=
search.auction.co.kr/search/listid.aspx?seller=odns&frm=list&category=21000000&tab=1
search.auction.co.kr/search/listid.aspx?seller=odritotu&frm=list&category=16000000&tab=1
itempage3.auction.co.kr/DetailView.aspx?ItemNo=A101214690
corners.auction.co.kr/corner/brand.aspx?brand=211&category=32071001
www.ribbonribbon.com/shop/step1.php?number=3142
forum.nasha.lv/viewtopic.php?p=102443
www.polondom.ru/index.php?page=19&id=381
www.rediff.com/gujarati/2002/apr/19dalal.htm
fullcast.jp/job/SearchWork.do?siteType=01&startIndex=0&areaKind=&branchId=&wideJobClass=&middleJobClass=&freeword=&pageType=02&keyword=0000028&branchId=&groupcorpId=&sortType=1
czudovo.info/what.php?what=%E0%E2%EE%F1%FC&ln=hy&in=from_ru

(Then I realized that the list was still quite long and it would be better to write a script that checks those things.)

These URLs need further analysis as to whether it is better to support a particular label (e.g. because the page only has one encoding decl and uses that encoding), it makes no difference (e.g. because it has a later encoding decl with a supported label that maps to the same encoding), or it's better to *not* support it (e.g. because it has a later (e.g. because it has a later encoding decl with a supported label that maps to a different encoding that the page actually uses).

Comment 3 Simon Pieters 2012-04-19 05:20:56 UTC

(In reply to comment #2)
> (Then I realized that the list was still quite long and it would be better to
> write a script that checks those things.)

Script
http://simon.html5.org/dump/encoding-labels/get-labels-with-non-ascii.py

Result
http://simon.html5.org/dump/encoding-labels/labels-with-non-ascii.zip

> These URLs need further analysis as to whether it is better to support a
> particular label (e.g. because the page only has one encoding decl and uses
> that encoding), it makes no difference (e.g. because it has a later encoding
> decl with a supported label that maps to the same encoding), or it's better to
> *not* support it (e.g. because it has a later (e.g. because it has a later
> encoding decl with a supported label that maps to a different encoding that the
> page actually uses).

Also applies to this.

Comment 4 Simon Pieters 2012-04-19 07:55:20 UTC

Now tried a slightly different approach where the script looks for a later encoding declaration and categorizes the pages:

http://simon.html5.org/dump/encoding-labels/labels-with-nonascii-categorized.txt

From this we can see that the following labels make no difference as to whether they are supported since all pages (in this data set) have a later encoding declaration with a supported alias:

cswindows31j
ms936
iso8859-9
cp1254

And x-mac-turkish appears to have no pages with non-ASCII bytes, so also makes no different whether it's supported.

Pages that have non-zero "pages with later supported decl for other encoding" are possibly better to not support, but this needs investigation by checking the URLs manually.

Comment 5 Simon Pieters 2012-04-19 08:27:44 UTC

x-user-defined pages seem to be broken in my browsers; possibly it should be recognized as "use the locale-dependent fallback encoding" since some pages had a later iso-8859-1 declaration but was not actually encoded in iso-8859-1.

Comment 6 Simon Pieters 2012-04-19 08:35:53 UTC

cp1251 appears to not make any difference whether it's supported for this data set.

Comment 7 Simon Pieters 2012-04-19 08:37:30 UTC

Same with cp1250

Comment 8 Simon Pieters 2012-04-19 08:53:07 UTC

Same with cp1252

It seems many pages have changed or disappeared, so this dataset isn't too useful for checking live pages. :-(

Comment 9 Simon Pieters 2012-04-19 09:16:08 UTC

http://tyosaku.hanrei.jp/detailPageLink/cr/6%8F%F01%8D%80+%98Z%8F%F0%88ꍀ/page0.html has MS932 in HTTP header and MS-932 in <meta> and appears to use shift_jis. Neither of those labels are in the spec.

Comment 10 Simon Pieters 2012-04-19 09:26:47 UTC

There are some live pages that use the 'sjis' label and use shift_jis encoding

http://oyazich.s151.xrea.com/test/read.cgi/etc/1202122384/512
http://eiyou-s.com/new_item/s-2-5-12-62.html
http://otakara.chips.jp/carryover/aisyou21.php
http://bbs.kokugakuin.info/test/r.cgi/syumi/1096090311/l10

Comment 11 Simon Pieters 2012-04-19 09:34:59 UTC

The following pages use ks_c_5601-1987 in HTTP header and MS949 in <meta>:

http://blog.naver.com/hyoung307/20059659021 
http://blog.naver.com/runeslove

The following page uses ks_c_5601-1987 in both HTTP header and <meta>:

http://cad.daoudata.co.kr/index.php?part=product&code=news&mode=view&idx=13&start=0&s_mode=&ct1=

Comment 12 Simon Pieters 2012-04-19 09:46:02 UTC

More ks_c_5601-1987: seemingly all of *.auction.co.kr (31 pages in this sample)

Comment 13 Simon Pieters 2012-04-19 09:55:05 UTC

OK I've looked through the data, can't get futher with this. It seems safe to conclude that ks_c_5601-1987 and sjis should be added to the spec. For the other labels, we need to do a new study with fresh data, I think.

Comment 14 Anne 2012-04-19 13:40:45 UTC

http://dvcs.w3.org/hg/encoding/rev/d3ea478b3c73

Comment 15 Anne 2012-10-11 15:15:48 UTC

x-user-defined has meanwhile been added as well per feedback from hsivonen.

Comment 16 Anne 2012-11-16 13:50:07 UTC

Closing this. I people feel more work needs to be done lets open a dedicated bug for that.