28141 2015-03-04 23:33:24 +0000 treatment of invalid 2-byte sequence is different in different CJK encodings 2015-08-19 12:51:12 +0000 1 1 1 Unclassified WHATWG Encoding unspecified PC Linux RESOLVED FIXED P2 normal Unsorted 1 jshin annevk jsbell mike philipj www-international sideshowbarker+encodingspec oldest_to_newest 118330 0 jshin 2015-03-04 23:33:24 +0000 Per bug 16691 comment 15, I'm tightening Blink's encoding tables for CJK encodings to handle unmappable 2-byte sequence in a safe manner. The current spec has the following provision after looking up |pointer|. * EUC-KR decoder If pointer is null and byte is in the range 0x00 to 0x7F, prepend byte to stream. * Big5 decoder If pointer is null and byte is in the range 0x00 to 0x7F, prepend byte to stream. * Shift_JIS decoder If pointer is null, prepend byte to stream. * EUC-JP decoder If byte is not in the range 0xA1 to 0xFE, prepend byte to stream. * GB18030 decoder If pointer is null, prepend byte to stream. For now, let's put aside EUC-JP and GB18030. I don't see a reason to make SJIS decoder behave differently than EUC-KR and Big5 decoder. One possible reason may be that [xA1, xDF] is a character by itself. In SJIS, "0xFC 0xE0" [1] is turned to U+FFFD, but the second byte (0xE0) becomes the lead of what follows. In EUC-KR, "0xFE 0xE0" is turned to U+FFFD and the next lead byte is taken from the byte *after* 0xE0. Shouldn't we change the part of SJIS decoder quoted above to the following? If pointer is null and byte is in the range of 0x00 - 0x7F, prepend byte to the stream. 118332 1 jshin 2015-03-05 00:06:26 +0000 The current EUC-JP spec makes sense so that there's no need to change it. I haven't taken a look at GB18030, yet. Anyway, so far SJIS is the only one that we have to consider changing. 118403 2 jshin 2015-03-06 19:18:51 +0000 Another piece of information: I was tightening Chromium's Big5's table and found that it has a lot of "holes" in the trail byte in the ASCII range. Below is what I found (all in hexadecimal). lead: trail byte holes in the ASCII range 87: 76 89: 42 44 45 4A 4B 8A: 42 63 75 8B: 54 8D: 41 9B: 61 9F: 4E A0: 54 57 5A 62 72 They're all in [a-zA-Z]. So, arguably, the XSS risk is lower than 'punctuation-mark-like characters' in the ASCII range. In case of EUC-KR (windows-949), the trail byte in the ASCII range is limited to [a-zA-Z]. So, without 'adding back to the stream' clause, we'd only eat up [a-zA-Z]. Unless we're sure that [a-zA-Z] is harmless when eaten up, we should keep 'adding back to the stream if the trail is [0, 7F]" clause (in case of ICU, perhaps the overall memory/perf impact of keeping the current spec is neutral to a small net-loss; haven't compared yet). Anyway, it occurred to me that we might think about this, too. 118576 3 philipj 2015-03-13 03:32:15 +0000 What do existing implementations do for SJIS? 118668 4 jshin 2015-03-18 21:13:31 +0000 ICU treats an 'illegal' byte sequence differently from a byte sequence 'unassigned' to a Unicode character. For instance, in EUC-KR (windows-949), <FE A1> is a valid byte sequence, but is not assigned any character. So, the sequence as a whole is turned to U+FFFD. Without tightening the vaild trail byte range for EUC-KR [1], <FE 41> is a valid byte sequence and is converted to U+FFFD (exactly the same treatment as <FE A1>). OTOH, <FE 22> has an illegal trail byte (because 0x22 is outside the trail byte range for EUC-KR/Windows-949) and is turned to <U+FFFD, U+0022> The same is true of Shift_JIS. Because [80-FC] is the valid trail byte range, <EB 9F> is turned to U+FFFD (there's no mapped character at this position) instead of <U+FFFD> being emitted and '0x9F' being added back to the stream [1] Blink is just tightening up the valid trail byte range so that 'x41' will not be valid any more if lead is C8 or higher. 118683 5 philipj 2015-03-19 14:07:08 +0000 Hmm, OK. If there's a spec change you want to (or have already) implement that's likely to be Web compatible and closer to what ICU already does, that probably won't be controversial. Concretely, is it only the SJIS bit that should be changed in the spec? (Anne has the final say of course, I'm just trying to move things along.) 122661 6 annevk 2015-08-19 12:51:12 +0000 https://github.com/whatwg/encoding/issues/5 changed big5 to check the code point rather than the pointer. shift_jis had that problem too, but indeed, we should eat the trail byte for shift_jis if it is not an ASCII byte. euc-kr seems wrong too based on that. gb18030 too. So I fixed shift_jis, euc-kr, and gb18030. https://github.com/whatwg/encoding/commit/640bf69847a17fd98df027fd6cd5ae384ac82dab