<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>27851</bug_id>
          
          <creation_ts>2015-01-19 10:45:59 +0000</creation_ts>
          <short_desc>Add MS932 as a label of Shift_JIS</short_desc>
          <delta_ts>2015-08-19 11:13:56 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>Encoding</component>
          <version>unspecified</version>
          <rep_platform>All</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          <see_also>https://bugzilla.mozilla.org/show_bug.cgi?id=1120813</see_also>
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Henri Sivonen">hsivonen</reporter>
          <assigned_to name="Anne">annevk</assigned_to>
          <cc>jsbell</cc>
    
    <cc>jshin</cc>
    
    <cc>mike</cc>
    
    <cc>www-international</cc>
    
    <cc>zcorpan</cc>
          
          <qa_contact>sideshowbarker+encodingspec</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>117304</commentid>
    <comment_count>0</comment_count>
    <who name="Henri Sivonen">hsivonen</who>
    <bug_when>2015-01-19 10:45:59 +0000</bug_when>
    <thetext>https://bugzilla.mozilla.org/show_bug.cgi?id=1120813 shows an email failure when using Encoding Standard label resolution for processing incoming email, because MS932 is not recognized as a label of Shift_JIS.

According to a Wikipedia pages linked to from the comments on that bug, Java started recognizing the label windows-31j in JDK 1.4.1. Nowadays Java uses IANA preferred names and treats MS932 as an alias for windows-31j, which is the preferred name. The theory is that email labeled MS932 arises from MS932 having been the Java-recognized way to name the Windows-flavor of Shift_JIS in legacy Java.

https://wiki.whatwg.org/wiki/Web_Encodings#Encodings indicates that Presto-Opera supported MS932 as a label of Shift_JIS. (And apparently also supported the cp932 label, which the JDK doesn&apos;t know about.)

Although, we don&apos;t have a Web-motivated indication justifying the introduction of ms932 as a label of Shift_JIS, it seems probably harmless and might fix more than the one email that has been reported.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>117315</commentid>
    <comment_count>1</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2015-01-19 13:44:39 +0000</bug_when>
    <thetext>Apparently Opera was the only browser to support this as a label, data would be good to have before making this change, as we&apos;ve seen before that adding a label can be problematic (shift-jis is not a label for shift_jis):

  Content-Type: text/html; charset=shift-jis

  &lt;meta charset=utf-8&gt;</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>117317</commentid>
    <comment_count>2</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2015-01-19 15:30:52 +0000</bug_when>
    <thetext>http://webdevdata.org/ data set 2015-01-08 (780 Mb) 87,000 pages.

1 page would be fixed by supporting the label, none would regress.

daimaru-matsuzakaya.jp has
Content-Type: text/html;charset=MS932
&lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=Shift_JIS&quot; /&gt;
OK either way

eplus.jp
Content-Type: text/html; charset=MS932
&lt;META http-equiv=&quot;Content-Type&quot; content=&quot;text/html;charset=Windows-31J&quot;&gt;
OK either way

benesse.jp
Content-Type: text/html;charset=MS932
&lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=Shift_JIS&quot;&gt;
OK either way

saisoncard-sindan.jp
Content-Type: text/html
&lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=MS932&quot;&gt;
OK in Presto, broken elsewhere

peachjohn.co.jp
Content-Type: text/html;charset=MS932
&lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=Shift_JIS&quot; /&gt;
OK either way</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>117320</commentid>
    <comment_count>3</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2015-01-19 15:48:47 +0000</bug_when>
    <thetext>httparchive https://www.igvita.com/2013/06/20/http-archive-bigquery-web-performance-answers/

SELECT page, COUNT(*) as num
FROM [httparchive:runs.2014_08_15_requests_body]
WHERE LOWER(mimeType) CONTAINS &quot;ms932&quot;
OR REGEXP_MATCH(LOWER(body), r&quot;\bms932\b&quot;)
GROUP BY page
ORDER BY num desc;

2 matches.

http://www.51sole.com/
Content-Type: text/html; charset=utf-8
&lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=utf-8&quot; /&gt;
False positive.

http://www.bestusedtires.com/
Content-Type: text/html; charset=UTF-8
&lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=utf-8&quot;/&gt;
False positive.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>117323</commentid>
    <comment_count>4</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2015-01-19 16:14:35 +0000</bug_when>
    <thetext>(I misremembered the issue in comment 1 as shift-jis is clearly a known label. It was euc_jp getting recognized as euc-jp I think.)

Those pages that have Content-Type: text/html;charset=MS932 would actually be slightly better of as we would know the encoding for certain and would no longer have to scan for it in the HTML.

Thanks, I guess we should add it. Anyone see any good reason not to do it?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>117336</commentid>
    <comment_count>5</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2015-01-20 07:25:07 +0000</bug_when>
    <thetext>Shawn Steele (Microsoft) is opposed to adding this label: https://lists.w3.org/Archives/Public/www-international/2015JanMar/0012.html</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>117343</commentid>
    <comment_count>6</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2015-01-20 09:57:29 +0000</bug_when>
    <thetext>(In reply to Anne from comment #4)
&gt; Those pages that have Content-Type: text/html;charset=MS932 would actually
&gt; be slightly better of as we would know the encoding for certain and would no
&gt; longer have to scan for it in the HTML.

I think these are only &quot;better&quot; in theory, in practice they&apos;re equivalent. Users won&apos;t notice any difference whatsoever and the &lt;meta&gt; will probably most often arrive in the same packet as the header, so there&apos;s no measurable performance impact either.

&gt; Thanks, I guess we should add it. Anyone see any good reason not to do it?

1 broken page doesn&apos;t seem particularly convincing to move away from the interop (ignoring Presto) of not supporting the label. It also still seems plausible that there are other pages on the long tail with the opposite expectation.

Search for &quot;html charset ms932&quot; on github (171 matches, not analyzed).

https://github.com/search?utf8=✓&amp;q=&quot;html+charset+ms932&quot;&amp;type=Code&amp;ref=searchresults

Variants of &quot;html charset X&quot; and number of matches:

csshiftjis 2
ms_kanji 0
shift-jis/shift_jis 114,006
sjis 224
windows-31j 2,453
x-sjis 5,486
cp932 20
mscp932 0</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>117345</commentid>
    <comment_count>7</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2015-01-20 10:14:31 +0000</bug_when>
    <thetext>(In reply to Simon Pieters from comment #2)
&gt; saisoncard-sindan.jp
&gt; Content-Type: text/html
&gt; &lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=MS932&quot;&gt;
&gt; OK in Presto, broken elsewhere

Actually it is not broken in Firefox. I assume it&apos;s because Firefox makes the default encoding be Shift_JIS based on the .jp TLD.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>122655</commentid>
    <comment_count>8</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2015-08-19 11:13:56 +0000</bug_when>
    <thetext>Thank you for reporting this. I decided to add it based on the evidence that it would make decoding legacy resources more deterministic.

https://github.com/whatwg/encoding/commit/01db1f8d98a839636af8f883fa78a461c2cfc13c</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>