This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
https://bugzilla.mozilla.org/show_bug.cgi?id=1120813 shows an email failure when using Encoding Standard label resolution for processing incoming email, because MS932 is not recognized as a label of Shift_JIS. According to a Wikipedia pages linked to from the comments on that bug, Java started recognizing the label windows-31j in JDK 1.4.1. Nowadays Java uses IANA preferred names and treats MS932 as an alias for windows-31j, which is the preferred name. The theory is that email labeled MS932 arises from MS932 having been the Java-recognized way to name the Windows-flavor of Shift_JIS in legacy Java. https://wiki.whatwg.org/wiki/Web_Encodings#Encodings indicates that Presto-Opera supported MS932 as a label of Shift_JIS. (And apparently also supported the cp932 label, which the JDK doesn't know about.) Although, we don't have a Web-motivated indication justifying the introduction of ms932 as a label of Shift_JIS, it seems probably harmless and might fix more than the one email that has been reported.
Apparently Opera was the only browser to support this as a label, data would be good to have before making this change, as we've seen before that adding a label can be problematic (shift-jis is not a label for shift_jis): Content-Type: text/html; charset=shift-jis <meta charset=utf-8>
http://webdevdata.org/ data set 2015-01-08 (780 Mb) 87,000 pages. 1 page would be fixed by supporting the label, none would regress. daimaru-matsuzakaya.jp has Content-Type: text/html;charset=MS932 <meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS" /> OK either way eplus.jp Content-Type: text/html; charset=MS932 <META http-equiv="Content-Type" content="text/html;charset=Windows-31J"> OK either way benesse.jp Content-Type: text/html;charset=MS932 <meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS"> OK either way saisoncard-sindan.jp Content-Type: text/html <meta http-equiv="Content-Type" content="text/html; charset=MS932"> OK in Presto, broken elsewhere peachjohn.co.jp Content-Type: text/html;charset=MS932 <meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS" /> OK either way
httparchive https://www.igvita.com/2013/06/20/http-archive-bigquery-web-performance-answers/ SELECT page, COUNT(*) as num FROM [httparchive:runs.2014_08_15_requests_body] WHERE LOWER(mimeType) CONTAINS "ms932" OR REGEXP_MATCH(LOWER(body), r"\bms932\b") GROUP BY page ORDER BY num desc; 2 matches. http://www.51sole.com/ Content-Type: text/html; charset=utf-8 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> False positive. http://www.bestusedtires.com/ Content-Type: text/html; charset=UTF-8 <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> False positive.
(I misremembered the issue in comment 1 as shift-jis is clearly a known label. It was euc_jp getting recognized as euc-jp I think.) Those pages that have Content-Type: text/html;charset=MS932 would actually be slightly better of as we would know the encoding for certain and would no longer have to scan for it in the HTML. Thanks, I guess we should add it. Anyone see any good reason not to do it?
Shawn Steele (Microsoft) is opposed to adding this label: https://lists.w3.org/Archives/Public/www-international/2015JanMar/0012.html
(In reply to Anne from comment #4) > Those pages that have Content-Type: text/html;charset=MS932 would actually > be slightly better of as we would know the encoding for certain and would no > longer have to scan for it in the HTML. I think these are only "better" in theory, in practice they're equivalent. Users won't notice any difference whatsoever and the <meta> will probably most often arrive in the same packet as the header, so there's no measurable performance impact either. > Thanks, I guess we should add it. Anyone see any good reason not to do it? 1 broken page doesn't seem particularly convincing to move away from the interop (ignoring Presto) of not supporting the label. It also still seems plausible that there are other pages on the long tail with the opposite expectation. Search for "html charset ms932" on github (171 matches, not analyzed). https://github.com/search?utf8=✓&q="html+charset+ms932"&type=Code&ref=searchresults Variants of "html charset X" and number of matches: csshiftjis 2 ms_kanji 0 shift-jis/shift_jis 114,006 sjis 224 windows-31j 2,453 x-sjis 5,486 cp932 20 mscp932 0
(In reply to Simon Pieters from comment #2) > saisoncard-sindan.jp > Content-Type: text/html > <meta http-equiv="Content-Type" content="text/html; charset=MS932"> > OK in Presto, broken elsewhere Actually it is not broken in Firefox. I assume it's because Firefox makes the default encoding be Shift_JIS based on the .jp TLD.
Thank you for reporting this. I decided to add it based on the evidence that it would make decoding legacy resources more deterministic. https://github.com/whatwg/encoding/commit/01db1f8d98a839636af8f883fa78a461c2cfc13c