This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Two examples in <https://gitorious.org/whatwg/big5/trees/master/hkscs-vs-uao/hk/spec>: http://www.budaedu.org.hk/budaedu/qm-04.html http://www.budaedu.org.hk/budaedu/shwd-02.html In both cases, this step kicked in: "If pointer is null, decrease the byte pointer by one." Apparently this doesn't match existing implementations and is worse for this example. I suggest instead emitting an ASCII char if in range 0x00 to 0x7F, or otherwise U+FFFD. I'm not sure if this change will break other cases. If we can come up with a metric of some kind, I have a huge amount of data to try out various error handling schemes on.
These are the potential error situations we have for a valid lead byte and a trail byte: 1. Valid trail, no corresponding code point 1a. Valid ASCII trail, no corresponding code point 2. Invalid trail 2a. Invalid ASCII trail Currently the specification decreases the byte pointer for case 2. I think your suggestion is to do it for case 2a. I think some browsers might do 1a as well, not sure.
The problem seems to be that the byte pointer is decreased when it should not be, making the decoder go out of synch. The interesting byte sequences are D1 9E and C6 9F. In both cases, a valid first byte is followed by an invalid second byte, more specifically one in the range 7F--A0, whereas valid second bytes are 0x40--7E and A1--FE. IE6 (as well as IE7 and IE8, I believe, but not IE9) essentially handles such byte sequences as valid but undefined two-byte sequences and maps them to a single ASCII question mark. This approach may be more compatible with existing content. The only potential ASCII trail byte in this range is 7F, which is probably not worth emitting. Philip J: Looking for second bytes in the range 7F--A0 in your 'huge amount of data' might be useful.
Okay. So we should change substep 5 of step 5 to also require /byte/ to be less than 0x40 in addition to /pointer/ not being null? That would be 2a from comment 1 with an exception for 0x7F.
Philip, ping!
Oops, I was on parental leave in January, I'll look into this during this week!
By the way, the "huge amount of data" is here: http://html5.org/temp/hk-data.tar.gz (199M) SHA1: 26b5af227bd0c72280aeeba39b22d712fa8d6cae http://html5.org/temp/tw-data.tar.gz (708M) SHA1: 555c3a9dce5f93d00e9ae47e901091f6140bce52
I can confirm that changing step 5.5 to "If pointer is null and byte is less than 0x40, decrease the byte pointer by one" does fix these two cases. However, without an idea about what kinds of problems the pointer decrease is intended to catch, it's hard for me to guess if it might have unintended side-effects.
We decrease the pointer so that e.g. a lead byte cannot mask " (0x22) for instance which could lead to subtle XSS attacks.
Ah, I see. In any event, I have implemented the algorithm in Python and will try to look at cases where the pointer is decreased to confirm properly that doing it just for < 0x40 is the most compatible with existing content.
OK, so here's my analysis of the data: https://gitorious.org/whatwg/big5/source/fd846e26a8625bd11ece23c9de150e722435c0d0:invalid-trail The vast majority of cases were misencoded junk, as well as many where it doesn't really matter in context which error handling is used. These are the trail bytes where it did matter: rewind: 20 22 26 27 2C 3C 3E skip: 92 9E 9F Given the large input there were surprisingly few cases where the error handling mattered, but fortunately the few cases where it does follows a pattern. Only rewinding for <0x40 would work. Another approach would be to only rewind when the trail byte *isn't* a valid lead byte, which is the case where the decoder goes out of sync. The only difference between the two would be what happens to 0x7F and whether or not double U+FFFD will be emitted for what remains in 0x80 and above. Perhaps reverse engineering what browsers do is the safest?
I created a test page to determine what browsers do: https://gitorious.org/whatwg/big5/raw/20ca0f32e7fc429fce2809d3b88f3757ac0256ed:invalid-trail.html I've tested Chromium 28.0.1500.71, Firefox 23.0 and Opera 12.16 (Presto). All three will emit the 0x7F. For the trail bytes above that, it looks like the only difference is whether 1 or 2 U+FFFD are emitted. After looking at this, my recommendation would be to rewind if the invalid trail is < 0x80, which looks like it might be what Gecko does since it only emits a single U+FFFD for >=0x80 invalid trails.
I added the <0x80 check to the Python implementation and verified that it gives the desired output for the categorized invalid trail bytes. I think I'm done now, go forth and spec it!
https://github.com/whatwg/encoding/commit/88a2177754655255df378e1b97cd085420399fe4
LGTM!