<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>16839</bug_id>
          
          <creation_ts>2012-04-24 16:03:56 +0000</creation_ts>
          <short_desc>Shift_JIS encoder is incompatible with current implementations</short_desc>
          <delta_ts>2015-01-21 20:31:35 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>Encoding</component>
          <version>unspecified</version>
          <rep_platform>All</rep_platform>
          <op_sys>Windows 3.1</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Masatoshi Kimura">VYV03354</reporter>
          <assigned_to name="Anne">annevk</assigned_to>
          <cc>jshin</cc>
    
    <cc>mike</cc>
    
    <cc>pub-w3</cc>
          
          <qa_contact>sideshowbarker+encodingspec</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>66980</commentid>
    <comment_count>0</comment_count>
    <who name="Masatoshi Kimura">VYV03354</who>
    <bug_when>2012-04-24 16:03:56 +0000</bug_when>
    <thetext>Shift_JIS duplicate characters have the following precedence order:
1. JIS83 characters (index 125 to 166)
2. NEC special characters (index 1128 to 1219)
3. IBM extensions (index 10716 to 11103)
4. NEC selected IBM extensions (index 8272 to 8647)
The &quot;first pointer&quot; rule fails to give higher priority to IBM extensions. Maybe index files should have a way to indicate &quot;decode only&quot; index.
This order is implemented by virtually all browsers (at least IE, Firefox, Chrome, Safari and Opera) and it is even documented.
http://support.microsoft.com/kb/170559 (Japanese; no English KB is available)
Note that this rule is applied only to the Shift_JIS encoder because EUC and ISO-2022-JP cannot access to index values 8836 or larger.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66990</commentid>
    <comment_count>1</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2012-04-24 20:32:52 +0000</bug_when>
    <thetext>Alternatively we could document what you list above. Lookup the code point given in this pointer range, then this pointer range, then this pointer range, etc. I suspect this may apply to other encoders as well though so maybe I should reconsider this index design.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66997</commentid>
    <comment_count>2</comment_count>
    <who name="Masatoshi Kimura">VYV03354</who>
    <bug_when>2012-04-24 23:09:53 +0000</bug_when>
    <thetext>Either way is fine as long as the spec does not diverge from already converged implementations.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>66998</commentid>
    <comment_count>3</comment_count>
    <who name="Masatoshi Kimura">VYV03354</who>
    <bug_when>2012-04-24 23:10:51 +0000</bug_when>
    <thetext>1., 2. and 3. is index order, so only IBM extensions need to be special cased.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>67025</commentid>
    <comment_count>4</comment_count>
    <who name="">pub-w3</who>
    <bug_when>2012-04-25 15:52:03 +0000</bug_when>
    <thetext>This is documented in Lunde as well.

(There is at least one duplicate below 8836, but the ‘first pointer’ rule probably handles that.)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>67058</commentid>
    <comment_count>5</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2012-04-25 20:12:21 +0000</bug_when>
    <thetext>http://lists.w3.org/Archives/Public/www-archive/2012Apr/0062.html has the duplicate code points for all indexes.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>67091</commentid>
    <comment_count>6</comment_count>
    <who name="Masatoshi Kimura">VYV03354</who>
    <bug_when>2012-04-25 23:25:27 +0000</bug_when>
    <thetext>Filed bug 16862 for gbk.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>81388</commentid>
    <comment_count>7</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-01-15 10:40:06 +0000</bug_when>
    <thetext>We cannot just special case the range 10716 to 11103 as that would give the wrong result for e.g. U+2160 per comment 5.

So the solution is either to create a special index or to do the lookup per comment 0. Search in those ranges (1-3) first and then start from the beginning if nothing is found (potentially skipping those ranges (1-3) although I would not expect anyone to actually implement it like this).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>81425</commentid>
    <comment_count>8</comment_count>
    <who name="">pub-w3</who>
    <bug_when>2013-01-15 20:04:15 +0000</bug_when>
    <thetext>‘Lookup per comment 0’ can be defined a bit more simply by saying that the search is to proceed as usual, but with indices 8,836 (94*94) and above (in practice 10,716 to 11,103) inserted before 8,272 (88*94) for Shift-JIS.  Real implementations could easily generate an inverted index based on this.

Indicating non-reversible mappings in the index seems nicer in some ways, but it may be better to keep the index format simple if possible.  (Hong Kong Supplementary Character Set extensions are also handled by the algorithm with no additional information added to the index.)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>81428</commentid>
    <comment_count>9</comment_count>
    <who name="">pub-w3</who>
    <bug_when>2013-01-15 20:30:20 +0000</bug_when>
    <thetext>Actually, Shift-JIS encoders can just skip the range 8,272 to 8,835 (Rows 89 to 94) completely.

ISO-2022-JP and EUC-JP encoders may instead stop before 8,836, but continuing beyond Row 94 will not affect the result.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>103730</commentid>
    <comment_count>10</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-04-11 10:36:56 +0000</bug_when>
    <thetext>https://github.com/whatwg/encoding/commit/03f02c0134901cb706ded37b27457abb8d42e836</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>117404</commentid>
    <comment_count>11</comment_count>
    <who name="Jungshik Shin">jshin</who>
    <bug_when>2015-01-21 20:31:35 +0000</bug_when>
    <thetext>Filed bug 27878 for Big5</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>