16839 – Shift_JIS encoder is incompatible with current implementations

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 16839 - Shift_JIS encoder is incompatible with current implementations

Summary: Shift_JIS encoder is incompatible with current implementations

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	All Windows 3.1

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-04-24 16:03 UTC by Masatoshi Kimura
Modified:	2015-01-21 20:31 UTC (History)
CC List:	3 users (show)

See Also:

Attachments

Description Masatoshi Kimura 2012-04-24 16:03:56 UTC

Shift_JIS duplicate characters have the following precedence order:
1. JIS83 characters (index 125 to 166)
2. NEC special characters (index 1128 to 1219)
3. IBM extensions (index 10716 to 11103)
4. NEC selected IBM extensions (index 8272 to 8647)
The "first pointer" rule fails to give higher priority to IBM extensions. Maybe index files should have a way to indicate "decode only" index.
This order is implemented by virtually all browsers (at least IE, Firefox, Chrome, Safari and Opera) and it is even documented.
http://support.microsoft.com/kb/170559 (Japanese; no English KB is available)
Note that this rule is applied only to the Shift_JIS encoder because EUC and ISO-2022-JP cannot access to index values 8836 or larger.

Comment 1 Anne 2012-04-24 20:32:52 UTC

Alternatively we could document what you list above. Lookup the code point given in this pointer range, then this pointer range, then this pointer range, etc. I suspect this may apply to other encoders as well though so maybe I should reconsider this index design.

Comment 2 Masatoshi Kimura 2012-04-24 23:09:53 UTC

Either way is fine as long as the spec does not diverge from already converged implementations.

Comment 3 Masatoshi Kimura 2012-04-24 23:10:51 UTC

1., 2. and 3. is index order, so only IBM extensions need to be special cased.

Comment 4 pub-w3 2012-04-25 15:52:03 UTC

This is documented in Lunde as well.

(There is at least one duplicate below 8836, but the ‘first pointer’ rule probably handles that.)

Comment 5 Anne 2012-04-25 20:12:21 UTC

http://lists.w3.org/Archives/Public/www-archive/2012Apr/0062.html has the duplicate code points for all indexes.

Comment 6 Masatoshi Kimura 2012-04-25 23:25:27 UTC

Filed bug 16862 for gbk.

Comment 7 Anne 2013-01-15 10:40:06 UTC

We cannot just special case the range 10716 to 11103 as that would give the wrong result for e.g. U+2160 per comment 5.

So the solution is either to create a special index or to do the lookup per comment 0. Search in those ranges (1-3) first and then start from the beginning if nothing is found (potentially skipping those ranges (1-3) although I would not expect anyone to actually implement it like this).

Comment 8 pub-w3 2013-01-15 20:04:15 UTC

‘Lookup per comment 0’ can be defined a bit more simply by saying that the search is to proceed as usual, but with indices 8,836 (94*94) and above (in practice 10,716 to 11,103) inserted before 8,272 (88*94) for Shift-JIS.  Real implementations could easily generate an inverted index based on this.

Indicating non-reversible mappings in the index seems nicer in some ways, but it may be better to keep the index format simple if possible.  (Hong Kong Supplementary Character Set extensions are also handled by the algorithm with no additional information added to the index.)

Comment 9 pub-w3 2013-01-15 20:30:20 UTC

Actually, Shift-JIS encoders can just skip the range 8,272 to 8,835 (Rows 89 to 94) completely.

ISO-2022-JP and EUC-JP encoders may instead stop before 8,836, but continuing beyond Row 94 will not affect the result.

Comment 10 Anne 2014-04-11 10:36:56 UTC

https://github.com/whatwg/encoding/commit/03f02c0134901cb706ded37b27457abb8d42e836

Comment 11 Jungshik Shin 2015-01-21 20:31:35 UTC

Filed bug 27878 for Big5