This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Shift_JIS duplicate characters have the following precedence order: 1. JIS83 characters (index 125 to 166) 2. NEC special characters (index 1128 to 1219) 3. IBM extensions (index 10716 to 11103) 4. NEC selected IBM extensions (index 8272 to 8647) The "first pointer" rule fails to give higher priority to IBM extensions. Maybe index files should have a way to indicate "decode only" index. This order is implemented by virtually all browsers (at least IE, Firefox, Chrome, Safari and Opera) and it is even documented. http://support.microsoft.com/kb/170559 (Japanese; no English KB is available) Note that this rule is applied only to the Shift_JIS encoder because EUC and ISO-2022-JP cannot access to index values 8836 or larger.
Alternatively we could document what you list above. Lookup the code point given in this pointer range, then this pointer range, then this pointer range, etc. I suspect this may apply to other encoders as well though so maybe I should reconsider this index design.
Either way is fine as long as the spec does not diverge from already converged implementations.
1., 2. and 3. is index order, so only IBM extensions need to be special cased.
This is documented in Lunde as well. (There is at least one duplicate below 8836, but the ‘first pointer’ rule probably handles that.)
http://lists.w3.org/Archives/Public/www-archive/2012Apr/0062.html has the duplicate code points for all indexes.
Filed bug 16862 for gbk.
We cannot just special case the range 10716 to 11103 as that would give the wrong result for e.g. U+2160 per comment 5. So the solution is either to create a special index or to do the lookup per comment 0. Search in those ranges (1-3) first and then start from the beginning if nothing is found (potentially skipping those ranges (1-3) although I would not expect anyone to actually implement it like this).
‘Lookup per comment 0’ can be defined a bit more simply by saying that the search is to proceed as usual, but with indices 8,836 (94*94) and above (in practice 10,716 to 11,103) inserted before 8,272 (88*94) for Shift-JIS. Real implementations could easily generate an inverted index based on this. Indicating non-reversible mappings in the index seems nicer in some ways, but it may be better to keep the index format simple if possible. (Hong Kong Supplementary Character Set extensions are also handled by the algorithm with no additional information added to the index.)
Actually, Shift-JIS encoders can just skip the range 8,272 to 8,835 (Rows 89 to 94) completely. ISO-2022-JP and EUC-JP encoders may instead stop before 8,836, but continuing beyond Row 94 will not affect the result.
https://github.com/whatwg/encoding/commit/03f02c0134901cb706ded37b27457abb8d42e836
Filed bug 27878 for Big5