28141 – treatment of invalid 2-byte sequence is different in different CJK encodings

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 28141 - treatment of invalid 2-byte sequence is different in different CJK encodings

Summary: treatment of invalid 2-byte sequence is different in different CJK encodings

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-03-04 23:33 UTC by Jungshik Shin
Modified:	2015-08-19 12:51 UTC (History)
CC List:	4 users (show)

See Also:

Attachments

Description Jungshik Shin 2015-03-04 23:33:24 UTC

Per bug 16691 comment 15, I'm tightening Blink's encoding tables for CJK encodings to handle unmappable 2-byte sequence in a safe manner. 



The current spec has the following provision after looking up |pointer|. 

* EUC-KR decoder
   If pointer is null and byte is in the range 0x00 to 0x7F, prepend byte to stream.


* Big5 decoder
  
   If pointer is null and byte is in the range 0x00 to 0x7F, prepend byte to stream.

* Shift_JIS decoder
   If pointer is null, prepend byte to stream.

* EUC-JP decoder
   If byte is not in the range 0xA1 to 0xFE, prepend byte to stream.


* GB18030 decoder
   If pointer is null, prepend byte to stream.

For now, let's put aside EUC-JP and GB18030. 

I don't see a reason to make SJIS decoder behave differently than EUC-KR and Big5 decoder. One possible reason may be that [xA1, xDF] is a character by itself. 

In SJIS, "0xFC 0xE0" [1] is turned to U+FFFD, but the second byte (0xE0) becomes the lead of what follows.

In EUC-KR, "0xFE 0xE0" is turned to U+FFFD and the next lead byte is taken from the byte *after* 0xE0. 

Shouldn't we change the part of SJIS decoder quoted above to the following? 

  If pointer is null and byte is in the range of 0x00 - 0x7F, prepend byte to the stream.

Comment 1 Jungshik Shin 2015-03-05 00:06:26 UTC

The current EUC-JP spec makes sense so that there's no need to change it. 

I haven't taken a look at GB18030, yet. 

Anyway, so far SJIS is the only one that we have to consider changing.

Comment 2 Jungshik Shin 2015-03-06 19:18:51 UTC

Another piece of information: 

I was tightening Chromium's Big5's table and found that it has a lot of "holes" in the trail byte in the ASCII range. Below is what I found (all in hexadecimal). 

lead: trail byte holes in the ASCII range 
87: 76
89: 42 44 45 4A 4B
8A: 42 63 75
8B: 54
8D: 41
9B: 61
9F: 4E
A0: 54 57 5A 62 72

They're all in [a-zA-Z]. So, arguably, the XSS risk is lower than 'punctuation-mark-like characters' in the ASCII range. 

In case of EUC-KR (windows-949), the trail byte in the ASCII range is limited to [a-zA-Z]. So, without 'adding back to the stream' clause, we'd only eat up [a-zA-Z]. 


Unless we're sure that [a-zA-Z] is harmless when eaten up, we should keep 'adding back to the stream if the trail is [0, 7F]" clause (in case of ICU, perhaps the overall memory/perf impact of keeping the current spec is neutral to a small net-loss; haven't compared yet). 

Anyway, it occurred to me that we might think about this, too.

Comment 3 Philip Jägenstedt 2015-03-13 03:32:15 UTC

What do existing implementations do for SJIS?

Comment 4 Jungshik Shin 2015-03-18 21:13:31 UTC

ICU treats an 'illegal' byte sequence differently from a byte sequence 'unassigned' to a Unicode character. 

For instance, in EUC-KR (windows-949), <FE A1> is a valid byte sequence, but is not assigned any character. So, the sequence as a whole is turned to U+FFFD. 

Without tightening the vaild trail byte range for EUC-KR [1], <FE 41> is a valid byte sequence  and is converted to U+FFFD (exactly the same treatment as <FE A1>). 

OTOH, <FE 22> has an illegal trail byte (because 0x22 is outside the trail byte range for EUC-KR/Windows-949) and is turned to <U+FFFD, U+0022>  


The same is true of Shift_JIS. Because [80-FC] is the valid trail byte range, <EB 9F> is turned to U+FFFD (there's no mapped character at this position) instead of <U+FFFD> being emitted and '0x9F' being added back to the stream 



[1] Blink is just tightening up the valid trail byte range so that 'x41' will not be valid any more if lead is C8 or higher.

Comment 5 Philip Jägenstedt 2015-03-19 14:07:08 UTC

Hmm, OK. If there's a spec change you want to (or have already) implement that's likely to be Web compatible and closer to what ICU already does, that probably won't be controversial. Concretely, is it only the SJIS bit that should be changed in the spec?

(Anne has the final say of course, I'm just trying to move things along.)

Comment 6 Anne 2015-08-19 12:51:12 UTC

https://github.com/whatwg/encoding/issues/5 changed big5 to check the code point rather than the pointer.

shift_jis had that problem too, but indeed, we should eat the trail byte for shift_jis if it is not an ASCII byte.

euc-kr seems wrong too based on that.

gb18030 too.

So I fixed shift_jis, euc-kr, and gb18030.

https://github.com/whatwg/encoding/commit/640bf69847a17fd98df027fd6cd5ae384ac82dab