This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 27851 - Add MS932 as a label of Shift_JIS
Summary: Add MS932 as a label of Shift_JIS
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-19 10:45 UTC by Henri Sivonen
Modified: 2015-08-19 11:13 UTC (History)
5 users (show)

See Also:


Attachments

Description Henri Sivonen 2015-01-19 10:45:59 UTC
https://bugzilla.mozilla.org/show_bug.cgi?id=1120813 shows an email failure when using Encoding Standard label resolution for processing incoming email, because MS932 is not recognized as a label of Shift_JIS.

According to a Wikipedia pages linked to from the comments on that bug, Java started recognizing the label windows-31j in JDK 1.4.1. Nowadays Java uses IANA preferred names and treats MS932 as an alias for windows-31j, which is the preferred name. The theory is that email labeled MS932 arises from MS932 having been the Java-recognized way to name the Windows-flavor of Shift_JIS in legacy Java.

https://wiki.whatwg.org/wiki/Web_Encodings#Encodings indicates that Presto-Opera supported MS932 as a label of Shift_JIS. (And apparently also supported the cp932 label, which the JDK doesn't know about.)

Although, we don't have a Web-motivated indication justifying the introduction of ms932 as a label of Shift_JIS, it seems probably harmless and might fix more than the one email that has been reported.
Comment 1 Anne 2015-01-19 13:44:39 UTC
Apparently Opera was the only browser to support this as a label, data would be good to have before making this change, as we've seen before that adding a label can be problematic (shift-jis is not a label for shift_jis):

  Content-Type: text/html; charset=shift-jis

  <meta charset=utf-8>
Comment 2 Simon Pieters 2015-01-19 15:30:52 UTC
http://webdevdata.org/ data set 2015-01-08 (780 Mb) 87,000 pages.

1 page would be fixed by supporting the label, none would regress.

daimaru-matsuzakaya.jp has
Content-Type: text/html;charset=MS932
<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS" />
OK either way

eplus.jp
Content-Type: text/html; charset=MS932
<META http-equiv="Content-Type" content="text/html;charset=Windows-31J">
OK either way

benesse.jp
Content-Type: text/html;charset=MS932
<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">
OK either way

saisoncard-sindan.jp
Content-Type: text/html
<meta http-equiv="Content-Type" content="text/html; charset=MS932">
OK in Presto, broken elsewhere

peachjohn.co.jp
Content-Type: text/html;charset=MS932
<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS" />
OK either way
Comment 3 Simon Pieters 2015-01-19 15:48:47 UTC
httparchive https://www.igvita.com/2013/06/20/http-archive-bigquery-web-performance-answers/

SELECT page, COUNT(*) as num
FROM [httparchive:runs.2014_08_15_requests_body]
WHERE LOWER(mimeType) CONTAINS "ms932"
OR REGEXP_MATCH(LOWER(body), r"\bms932\b")
GROUP BY page
ORDER BY num desc;

2 matches.

http://www.51sole.com/
Content-Type: text/html; charset=utf-8
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
False positive.

http://www.bestusedtires.com/
Content-Type: text/html; charset=UTF-8
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
False positive.
Comment 4 Anne 2015-01-19 16:14:35 UTC
(I misremembered the issue in comment 1 as shift-jis is clearly a known label. It was euc_jp getting recognized as euc-jp I think.)

Those pages that have Content-Type: text/html;charset=MS932 would actually be slightly better of as we would know the encoding for certain and would no longer have to scan for it in the HTML.

Thanks, I guess we should add it. Anyone see any good reason not to do it?
Comment 5 Anne 2015-01-20 07:25:07 UTC
Shawn Steele (Microsoft) is opposed to adding this label: https://lists.w3.org/Archives/Public/www-international/2015JanMar/0012.html
Comment 6 Simon Pieters 2015-01-20 09:57:29 UTC
(In reply to Anne from comment #4)
> Those pages that have Content-Type: text/html;charset=MS932 would actually
> be slightly better of as we would know the encoding for certain and would no
> longer have to scan for it in the HTML.

I think these are only "better" in theory, in practice they're equivalent. Users won't notice any difference whatsoever and the <meta> will probably most often arrive in the same packet as the header, so there's no measurable performance impact either.

> Thanks, I guess we should add it. Anyone see any good reason not to do it?

1 broken page doesn't seem particularly convincing to move away from the interop (ignoring Presto) of not supporting the label. It also still seems plausible that there are other pages on the long tail with the opposite expectation.

Search for "html charset ms932" on github (171 matches, not analyzed).

https://github.com/search?utf8=✓&q="html+charset+ms932"&type=Code&ref=searchresults

Variants of "html charset X" and number of matches:

csshiftjis 2
ms_kanji 0
shift-jis/shift_jis 114,006
sjis 224
windows-31j 2,453
x-sjis 5,486
cp932 20
mscp932 0
Comment 7 Simon Pieters 2015-01-20 10:14:31 UTC
(In reply to Simon Pieters from comment #2)
> saisoncard-sindan.jp
> Content-Type: text/html
> <meta http-equiv="Content-Type" content="text/html; charset=MS932">
> OK in Presto, broken elsewhere

Actually it is not broken in Firefox. I assume it's because Firefox makes the default encoding be Shift_JIS based on the .jp TLD.
Comment 8 Anne 2015-08-19 11:13:56 UTC
Thank you for reporting this. I decided to add it based on the evidence that it would make decoding legacy resources more deterministic.

https://github.com/whatwg/encoding/commit/01db1f8d98a839636af8f883fa78a461c2cfc13c