This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 27878 - Big5 : handling of U+5341(and potentially other dupe points) is incompatible with Firefox, Chrome and IE 11
Summary: Big5 : handling of U+5341(and potentially other dupe points) is incompatible ...
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-21 20:30 UTC by Jungshik Shin
Modified: 2015-08-19 14:19 UTC (History)
4 users (show)

See Also:


Attachments

Description Jungshik Shin 2015-01-21 20:30:02 UTC
Spun off from bug 16389 

Duplicate entries in index-*.txt is http://lists.w3.org/Archives/Public/www-archive/2012Apr/0062.html


https://encoding.spec.whatwg.org/#index-pointer has the following:



The index pointer for code point in index is the first pointer corresponding to code point in index, or null if code point is not in index.

And, the big5 encoder has the following steps:

3. Let pointer be the index pointer for code point in index big5.

4. If pointer is null, return error with code point.

....




Using the first pointer for round-trip while using others for decoding-only (toUnicode) seems to lead to at least one discrepancy from Firefox 35, Chrome and IE 11 in Big5. 

index-big5.txt has two entries for U+5341 as shown below: 

  5287   0x5341  十 (<CJK Ideograph>)
  5512   0x5341  十 (<CJK Ideograph>)

5287 corresponds to {0xA2 0xCC} and 5512 is {0xA4 0x51}. 

All three browsers above encode U+5341 to {0xA4 0x51} in Big5 instead of {0xA2 0xCC}.
Comment 1 Jungshik Shin 2015-01-21 20:31:11 UTC
Oops. I meant bug 16839
Comment 2 Jungshik Shin 2015-01-21 20:41:17 UTC
I meant to copy'n'paste these steps for Big5 encoding, which makes most of HKSCSC extension decoding-only (toUnicode). 

5. Let lead be pointer / 157 + 0x81.

6. If lead is less than 0xA1, return error with code point.

  * Avoid returning Hong Kong Supplementary Character Set extensions literally.

Anyway, I found another problematic character: 

U+5345 has the same problem (at least, it's incompatible with Chrome's ICU - windows-950-2000.ucm).
Comment 3 Jungshik Shin 2015-01-21 20:49:10 UTC
Both Firefox 35 and Chrome encodes U+5345 as {0xA4, 0xCA} instead of {0xA2, 0xCE}
Comment 4 Jungshik Shin 2015-01-21 20:51:04 UTC
The entries in question for U+5345 : 

 5289	0x5345	卅 (<CJK Ideograph>)  => 0xA2 0xCE
 5599	0x5345	卅 (<CJK Ideograph>)  => 0xA4 0xCA
Comment 5 Philip Jägenstedt 2015-01-21 22:44:31 UTC
(In reply to Jungshik Shin from comment #0)
> Spun off from bug 16389 

That bug looks unrelated, did you paste the wrong one?
Comment 6 Jungshik Shin 2015-01-21 22:47:30 UTC
(In reply to Philip Jägenstedt from comment #5)
> (In reply to Jungshik Shin from comment #0)
> > Spun off from bug 16389 
> 
> That bug looks unrelated, did you paste the wrong one?

See comment 1 :-). It's bug 16839
Comment 7 Philip Jägenstedt 2015-01-21 22:56:31 UTC
Have you tested all the index entries which have duplicate Unicode points. I currently count (grep -F '(' | awk '{print $2}' | sort | uniq -c | grep -vw 1 | wc -l) 100 such cases in https://encoding.spec.whatwg.org/index-big5.txt

If there are only a handful of cases where the order needs to be reversed, perhaps special-casing those in the encoder would be the simplest.
Comment 8 Jungshik Shin 2015-01-21 23:44:59 UTC
(In reply to Philip Jägenstedt from comment #7)
> Have you tested all the index entries which have duplicate Unicode points. I
> currently count (grep -F '(' | awk '{print $2}' | sort | uniq -c | grep -vw
> 1 | wc -l) 100 such cases in https://encoding.spec.whatwg.org/index-big5.txt
> 
> If there are only a handful of cases where the order needs to be reversed,
> perhaps special-casing those in the encoder would be the simplest.

I skimmed over all of them and I found no other pairs.

I also looked for all the decode-only entries in windows-950-2000.ucm (ICU). There are only 10 of them including U+5341 and U+5345. 

The following additional characters are incompatible with the encoding spec's big5. (Firefox 35 does the same). 

U+2550
U+255E
U+2561
U+256A

They're all box-drawing characters and placed in row 0xF9 (for round-trip) while 0xA2 positions are for decoding only. 

Other box-drawing characters are placed in row 0xA2 in Big5 for round-trip while 0xF9 positions are for decoding only. 

I don't know if there's any logic behind this difference between two groups.