This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 16839 - Shift_JIS encoder is incompatible with current implementations
Summary: Shift_JIS encoder is incompatible with current implementations
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: All Windows 3.1
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-04-24 16:03 UTC by Masatoshi Kimura
Modified: 2015-01-21 20:31 UTC (History)
3 users (show)

See Also:


Attachments

Description Masatoshi Kimura 2012-04-24 16:03:56 UTC
Shift_JIS duplicate characters have the following precedence order:
1. JIS83 characters (index 125 to 166)
2. NEC special characters (index 1128 to 1219)
3. IBM extensions (index 10716 to 11103)
4. NEC selected IBM extensions (index 8272 to 8647)
The "first pointer" rule fails to give higher priority to IBM extensions. Maybe index files should have a way to indicate "decode only" index.
This order is implemented by virtually all browsers (at least IE, Firefox, Chrome, Safari and Opera) and it is even documented.
http://support.microsoft.com/kb/170559 (Japanese; no English KB is available)
Note that this rule is applied only to the Shift_JIS encoder because EUC and ISO-2022-JP cannot access to index values 8836 or larger.
Comment 1 Anne 2012-04-24 20:32:52 UTC
Alternatively we could document what you list above. Lookup the code point given in this pointer range, then this pointer range, then this pointer range, etc. I suspect this may apply to other encoders as well though so maybe I should reconsider this index design.
Comment 2 Masatoshi Kimura 2012-04-24 23:09:53 UTC
Either way is fine as long as the spec does not diverge from already converged implementations.
Comment 3 Masatoshi Kimura 2012-04-24 23:10:51 UTC
1., 2. and 3. is index order, so only IBM extensions need to be special cased.
Comment 4 pub-w3 2012-04-25 15:52:03 UTC
This is documented in Lunde as well.

(There is at least one duplicate below 8836, but the ‘first pointer’ rule probably handles that.)
Comment 5 Anne 2012-04-25 20:12:21 UTC
http://lists.w3.org/Archives/Public/www-archive/2012Apr/0062.html has the duplicate code points for all indexes.
Comment 6 Masatoshi Kimura 2012-04-25 23:25:27 UTC
Filed bug 16862 for gbk.
Comment 7 Anne 2013-01-15 10:40:06 UTC
We cannot just special case the range 10716 to 11103 as that would give the wrong result for e.g. U+2160 per comment 5.

So the solution is either to create a special index or to do the lookup per comment 0. Search in those ranges (1-3) first and then start from the beginning if nothing is found (potentially skipping those ranges (1-3) although I would not expect anyone to actually implement it like this).
Comment 8 pub-w3 2013-01-15 20:04:15 UTC
‘Lookup per comment 0’ can be defined a bit more simply by saying that the search is to proceed as usual, but with indices 8,836 (94*94) and above (in practice 10,716 to 11,103) inserted before 8,272 (88*94) for Shift-JIS.  Real implementations could easily generate an inverted index based on this.

Indicating non-reversible mappings in the index seems nicer in some ways, but it may be better to keep the index format simple if possible.  (Hong Kong Supplementary Character Set extensions are also handled by the algorithm with no additional information added to the index.)
Comment 9 pub-w3 2013-01-15 20:30:20 UTC
Actually, Shift-JIS encoders can just skip the range 8,272 to 8,835 (Rows 89 to 94) completely.

ISO-2022-JP and EUC-JP encoders may instead stop before 8,836, but continuing beyond Row 94 will not affect the result.
Comment 11 Jungshik Shin 2015-01-21 20:31:35 UTC
Filed bug 27878 for Big5