This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 21145 - Index gb18030 pointer algorithm doesn't do enough
Summary: Index gb18030 pointer algorithm doesn't do enough
Status: RESOLVED DUPLICATE of bug 16862
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: PC Windows NT
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-02-27 12:39 UTC by Peter Occil
Modified: 2013-12-16 16:11 UTC (History)
1 user (show)

See Also:


Attachments

Description Peter Occil 2013-02-27 12:39:49 UTC
http://encoding.spec.whatwg.org/#indexes

[[
The index gb18030 pointer for code point is the return value of these steps:

Let offset be the last code point in index gb18030 that is equal to or less than code point and let pointer offset be its corresponding pointer.

Return a pointer whose value is pointer offset + code point − offset.
]]

I'm afraid this may not be enough for this algorithm, since in this case, some code points may return the same pointer.  For instance, the return values for the code points 0xE7C9 and 0xE7E7 are identical: 33471. It seems to me that in reality, 0xE7C9 should not be a valid code point in GB18030.

I suggest the following change:

-----
The index gb18030 pointer for _code point_ is the return value of these steps:

If _code point_ is less than 0x80 or greater than 0x10FFFF, return null.

If _code point_ is greater than or equal to 0x10000, return a _pointer_ whose value is 189000 + _code point_ - 0x10000.

Let _offset_ be the last code point in index gb18030 that is equal to or less than _code point_ and let _pointer offset_ be its corresponding pointer.

Let _next pointer offset_ be the offset in the list that comes after pointer offset.

If _code point_ minus _offset_ is greater than or equal to _next pointer offset_ minus _pointer offset_, return null.

Return a _pointer_ whose value is _pointer offset_ + _code point_ − _offset_.

-----
Comment 1 Anne 2013-03-03 13:39:48 UTC
That seems kind of weird as gb18030 can supposedly encode all code points.
Comment 2 Anne 2013-12-12 18:19:46 UTC
There is a bug here though. Ugh.
Comment 3 Anne 2013-12-13 15:15:49 UTC
So the problem here is that we are not using the correct table for the two-byte sequences as far as I can tell.

0xE7C9 should map to a two-byte sequence, but since it's PUA we did not include it there, which causes problems.
Comment 4 Anne 2013-12-16 16:11:16 UTC

*** This bug has been marked as a duplicate of bug 16862 ***