16771 – big5 error handling

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 16771 - big5 error handling

Summary: big5 error handling

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC Windows 3.1

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-04-18 09:17 UTC by Philip Jägenstedt
Modified:	2013-12-12 20:33 UTC (History)
CC List:	2 users (show)

See Also:

Attachments

Description Philip Jägenstedt 2012-04-18 09:17:47 UTC

Two examples in <https://gitorious.org/whatwg/big5/trees/master/hkscs-vs-uao/hk/spec>:

http://www.budaedu.org.hk/budaedu/qm-04.html
http://www.budaedu.org.hk/budaedu/shwd-02.html

In both cases, this step kicked in: "If pointer is null, decrease the byte pointer by one."

Apparently this doesn't match existing implementations and is worse for this example. I suggest instead emitting an ASCII char if in range 0x00 to 0x7F, or otherwise U+FFFD.

I'm not sure if this change will break other cases. If we can come up with a metric of some kind, I have a huge amount of data to try out various error handling schemes on.

Comment 1 Anne 2012-04-18 10:03:37 UTC

These are the potential error situations we have for a valid lead byte and a trail byte:

1. Valid trail, no corresponding code point
1a. Valid ASCII trail, no corresponding code point
2. Invalid trail
2a. Invalid ASCII trail

Currently the specification decreases the byte pointer for case 2. I think your suggestion is to do it for case 2a. I think some browsers might do 1a as well, not sure.

Comment 2 pub-w3 2012-04-25 15:41:10 UTC

The problem seems to be that the byte pointer is decreased when it should not be, making the decoder go out of synch.

The interesting byte sequences are D1 9E and C6 9F.

In both cases, a valid first byte is followed by an invalid second byte, more specifically one in the range 7F--A0, whereas valid second bytes are 0x40--7E and A1--FE.

IE6 (as well as IE7 and IE8, I believe, but not IE9) essentially handles such byte sequences as valid but undefined two-byte sequences and maps them to a single ASCII question mark.  This approach may be more compatible with existing content.

The only potential ASCII trail byte in this range is 7F, which is probably not worth emitting.

Philip J:  Looking for second bytes in the range 7F--A0 in your 'huge amount of data' might be useful.

Comment 3 Anne 2013-01-21 15:19:54 UTC

Okay. So we should change substep 5 of step 5 to also require /byte/ to be less than 0x40 in addition to /pointer/ not being null?

That would be 2a from comment 1 with an exception for 0x7F.

Comment 4 Anne 2013-09-04 16:12:36 UTC

Philip, ping!

Comment 5 Philip Jägenstedt 2013-09-05 07:19:54 UTC

Oops, I was on parental leave in January, I'll look into this during this week!

Comment 6 Philip Jägenstedt 2013-09-05 07:52:30 UTC

By the way, the "huge amount of data" is here:

http://html5.org/temp/hk-data.tar.gz (199M)
SHA1: 26b5af227bd0c72280aeeba39b22d712fa8d6cae

http://html5.org/temp/tw-data.tar.gz (708M)
SHA1: 555c3a9dce5f93d00e9ae47e901091f6140bce52

Comment 7 Philip Jägenstedt 2013-09-05 20:06:46 UTC

I can confirm that changing step 5.5 to "If pointer is null and byte is less than 0x40, decrease the byte pointer by one" does fix these two cases.

However, without an idea about what kinds of problems the pointer decrease is intended to catch, it's hard for me to guess if it might have unintended side-effects.

Comment 8 Anne 2013-09-06 11:25:09 UTC

We decrease the pointer so that e.g. a lead byte cannot mask " (0x22) for instance which could lead to subtle XSS attacks.

Comment 9 Philip Jägenstedt 2013-09-06 13:40:57 UTC

Ah, I see. In any event, I have implemented the algorithm in Python and will try to look at cases where the pointer is decreased to confirm properly that doing it just for < 0x40 is the most compatible with existing content.

Comment 10 Philip Jägenstedt 2013-09-10 21:14:24 UTC

OK, so here's my analysis of the data:

https://gitorious.org/whatwg/big5/source/fd846e26a8625bd11ece23c9de150e722435c0d0:invalid-trail

The vast majority of cases were misencoded junk, as well as many where it doesn't really matter in context which error handling is used.

These are the trail bytes where it did matter:

rewind: 20 22 26 27 2C 3C 3E

skip: 92 9E 9F

Given the large input there were surprisingly few cases where the error handling mattered, but fortunately the few cases where it does follows a pattern.

Only rewinding for <0x40 would work.

Another approach would be to only rewind when the trail byte *isn't* a valid lead byte, which is the case where the decoder goes out of sync.

The only difference between the two would be what happens to 0x7F and whether or not double U+FFFD will be emitted for what remains in 0x80 and above. Perhaps reverse engineering what browsers do is the safest?

Comment 11 Philip Jägenstedt 2013-09-11 15:51:50 UTC

I created a test page to determine what browsers do:

https://gitorious.org/whatwg/big5/raw/20ca0f32e7fc429fce2809d3b88f3757ac0256ed:invalid-trail.html

I've tested Chromium 28.0.1500.71, Firefox 23.0 and Opera 12.16 (Presto).

All three will emit the 0x7F. For the trail bytes above that, it looks like the only difference is whether 1 or 2 U+FFFD are emitted.

After looking at this, my recommendation would be to rewind if the invalid trail is < 0x80, which looks like it might be what Gecko does since it only emits a single U+FFFD for >=0x80 invalid trails.

Comment 12 Philip Jägenstedt 2013-09-11 16:01:24 UTC

I added the <0x80 check to the Python implementation and verified that it gives the desired output for the categorized invalid trail bytes.

I think I'm done now, go forth and spec it!

Comment 13 Anne 2013-12-12 13:44:56 UTC

https://github.com/whatwg/encoding/commit/88a2177754655255df378e1b97cd085420399fe4

Comment 14 Philip Jägenstedt 2013-12-12 20:33:00 UTC

LGTM!