21057 2013-02-20 14:19:27 +0000 Introduce additional labels for the replacement encoding 2015-08-21 07:39:13 +0000 1 1 1 Unclassified WHATWG Encoding unspecified PC Linux RESOLVED MOVED blocked on implementer research P2 normal Unsorted 1 hsivonen annevk jsbell jshin mike VYV03354 www-international zcorpan sideshowbarker+encodingspec oldest_to_newest 83390 0 hsivonen 2013-02-20 14:19:27 +0000 Problem statement: 1) The Encoding Standard removes the ISO-2022-CN encoding. This will make sites that rely on that encoding being supported vulnerable to XSS the way Yahoo search was vulnerable in Chrome when Chrome removed ISO-2022-KR. See https://code.google.com/p/chromium/issues/detail?id=15701 2) There exist ASCII-incompatible encodings in the world outside the Encoding Standard and support for those encodings might be exposed if server-side libraries. Sites that are naïve enough to allow the user to specify the output encoding that the site uses and this past the user-supplied encoding name to server-side library without white listing ASCII-compatible encodings are vulnerable to EBCDIC attacks: An attacker can request that the site use an EBCDIC-based encoding and the site responds with EBCDIC which isn't recognized by non-IE browsers and browsers fall back on an ASCII-compatible encoding resulting in the EBCDIC bytes being interpreted in a dangerous way. See http://zaynar.co.uk/docs/charset-encoding-xss.html for a reference to an actual search engine that was vulnerable to this attack. Proposed solution: Define a replacement encoding that decodes all possible byte values to the REPLACEMENT CHARACTER. Make the known labels for ASCII-incompatible encodings that exist but aren't part of the Encoding Standard aliases for the replacement encoding. Additional info: This solution would pave the way for safe removal of ISO-2022-KR and hz-gb-2312 from the set of encodings supported by the Encoding Standard. 83391 1 annevk 2013-02-20 14:42:51 +0000 We should be conservative with this list I suppose as sites might rely on a fallback encoding being in play. We should probably include these: * iso-2022-cn * iso-2022-cn-ext Less sure about: * EBCDIC labels * utf-7 * utf-32 83392 2 hsivonen 2013-02-20 15:01:01 +0000 Hmm. Might want to allow 0x20 to decode as U+0020 to avoid accidentally DoSing layout. (In reply to comment #1) > Less sure about: > > * EBCDIC labels To the extent IE currently recognizes these, at least in theory: * Relying on falling back to ASCII doesn't work today in IE. * IE would become less XSS-resilient if it dropped knowledge of those labels without aliasing them to a replacement encoding. > * utf-7 > * utf-32 These might plausibly be relying on fallback currently. Others to consider: * CESU-8 * BOCU-1 * SCSU 83393 3 annevk 2013-02-20 15:12:13 +0000 We could emit a single U+FFFD and terminate I think. Pretend as if all bytes were consumed. 83421 4 VYV03354 2013-02-20 19:20:09 +0000 Chrome implements a "fake" ISO-2022-CN decoder which always emit U+FFFD for all double-byte characters. 83422 5 VYV03354 2013-02-20 19:26:08 +0000 But I don't think ISO-2022-CN problem is really exploitable in the read world. Gecko's ISO-2022-CN decoder has a bug for a long time which is exploitable. I even wrote it in the public bug. But nobody didn't care. https://bugzilla.mozilla.org/show_bug.cgi?id=470523 So Gecko completely ignores ISO-2022-CN label since Firefox 19. 83543 6 annevk 2013-02-22 10:55:57 +0000 We also have to decide what to do for TextDecoder. And the encoder story for <form accept-charset>, script injecting a link into a iso-2022-kr <iframe>, and maybe more. I think the encoder story can be utf-8. Supporting it in TextDecoder does not seem problematic. TextEncoder is already prohibited. 83582 7 VYV03354 2013-02-22 19:11:52 +0000 I think we should also remove them from TextDecoder for consistency. If people really need to decode those encodings, they can implement the decoder using gbk/euc-kr decoders. 83583 8 VYV03354 2013-02-22 19:14:25 +0000 (In reply to comment #2) > Others to consider: > * CESU-8 > * BOCU-1 > * SCSU I don't think we need to consider about encodings no browsers have ever been supported. If by any chance some pages relied on those encodings, they are already vulnerable. 83585 9 annevk 2013-02-22 19:34:42 +0000 What is the rationale for that? They might already be vulnerable, but would it not be better if they were less vulnerable going forward? 83586 10 VYV03354 2013-02-22 19:37:54 +0000 If the vulnerable page is actually present in the real world at all. 83587 11 annevk 2013-02-22 19:41:52 +0000 As a trial balloon: https://github.com/whatwg/encoding/commit/8329a2e768caea6908d600debd3cc8a6dc59c3c3 (I.e. not final, but gives us a thing to discuss.) 83614 12 annevk 2013-02-23 08:03:18 +0000 So going forward everything under EBCDIC in http://wiki.whatwg.org/wiki/Web_Encodings#Encodings_3 should be added. Let me know if you disagree. Then once implementations remove iso-2022-kr I will add that one too. 83646 13 hsivonen 2013-02-25 11:43:53 +0000 (In reply to comment #8) > (In reply to comment #2) > > Others to consider: > > * CESU-8 > > * BOCU-1 > > * SCSU > > I don't think we need to consider about encodings no browsers have ever been > supported. If by any chance some pages relied on those encodings, they are > already vulnerable. The threat scenario is that the server accepts an encoding name from a query string and passes it to a server-side library that implements encodings that browsers don't support. 96961 14 annevk 2013-12-02 14:17:38 +0000 Per http://mxr.mozilla.org/mozilla-central/source/dom/encoding/labelsencodings.properties Gecko seems to match the specification. Do we want to add any of the other ones or should I resolve this as FIXED? 97109 15 hsivonen 2013-12-04 10:15:55 +0000 (In reply to Anne from comment #14) > Do we want to add any of > the other ones or should I resolve this as FIXED? I think we should add * BOCU-1 * SCSU * Known EBCDIC labels. ...as labels of the replacement encoding in order to mitigate the attack described in http://zaynar.co.uk/docs/charset-encoding-xss.html . If Google Translate works for http://masatokinugawa.l0.cm/2013/06/accounts.google.com-utf-32-xss.html , it appears that Google, who really should know better, allowed the output encoding to be controlled by the request URL. UTF-7 is not on the list, because it's not dangerous to interpret UTF-7 as ASCII and there's some value in seeing the ASCII decoding of UTF-7 for Latin-script text. UTF-32 is not on the list, because the BOM taking precedence and the little-endian UTF-32 sniffing as UTF-16LE would make aliasing to replacement a mere placebo. Furthermore, interpreting UTF-32 as non-UTF-32 doesn't appear to be dangerous when U+0000 is not discarded before tokenization, which it isn't in HTML. 97131 16 annevk 2013-12-04 16:05:36 +0000 These are the known EBCDIC ones that IE supports per the WHATWG table including BOCU-1 and SCSU labels: * bocu-1 * ccsid00924 * ccsid01140 * ccsid01141 * ccsid01142 * ccsid01143 * ccsid01144 * ccsid01145 * ccsid01146 * ccsid01147 * ccsid01148 * ccsid01149 * cp00924 * cp01140 * cp01141 * cp01142 * cp01143 * cp01144 * cp01145 * cp01146 * cp01147 * cp01148 * cp01149 * cp037 * cp1025 * cp1026 * cp273 * cp278 * cp280 * cp284 * cp285 * cp290 * cp297 * cp420 * cp423 * cp424 * cp500 * cp870 * cp871 * cp875 * cp880 * cp905 * cp930 * cp933 * cp935 * cp937 * cp939 * csbocu-1 * csbocu1 * csibm037 * csibm1026 * csibm273 * csibm277 * csibm278 * csibm280 * csibm284 * csibm285 * csibm290 * csibm297 * csibm420 * csibm423 * csibm424 * csibm500 * csibm870 * csibm871 * csibm880 * csibm905 * csibmthai * csscsu * ebcdic-cp-ar1 * ebcdic-cp-be * ebcdic-cp-ca * ebcdic-cp-ch * ebcdic-cp-dk * ebcdic-cp-es * ebcdic-cp-fi * ebcdic-cp-fr * ebcdic-cp-gb * ebcdic-cp-gr * ebcdic-cp-he * ebcdic-cp-is * ebcdic-cp-it * ebcdic-cp-nl * ebcdic-cp-no * ebcdic-cp-roece * ebcdic-cp-se * ebcdic-cp-tr * ebcdic-cp-us * ebcdic-cp-wt * ebcdic-cp-yu * ebcdic-cyrillic * ebcdic-de-273+euro * ebcdic-dk-277+euro * ebcdic-es-284+euro * ebcdic-fi-278+euro * ebcdic-fr-297+euro * ebcdic-gb-285+euro * ebcdic-international-500+euro * ebcdic-is-871+euro * ebcdic-it-280+euro * ebcdic-jp-kana * ebcdic-latin9--euro * ebcdic-no-277+euro * ebcdic-se-278+euro * ebcdic-us-37+euro * ibm-thai * ibm00924 * ibm01047 * ibm01140 * ibm01141 * ibm01142 * ibm01143 * ibm01144 * ibm01145 * ibm01146 * ibm01147 * ibm01148 * ibm01149 * ibm037 * ibm1026 * ibm273 * ibm277 * ibm278 * ibm280 * ibm284 * ibm285 * ibm290 * ibm297 * ibm420 * ibm423 * ibm424 * ibm500 * ibm870 * ibm871 * ibm880 * ibm905 * scsu * x-cp21027 * x-ebcdic-japaneseanduscanada * x-ebcdic-koreanextended Of course on the server ICU might be used and which labels we want to ban from that is unclear to me. ICU supports a lot of labels, including weird ones like "ISO_2022,locale=ko,version=0". 114479 17 annevk 2014-11-04 13:54:01 +0000 Joshua, Jungshik, Henri, is there interest in adding the labels mentioned in comment 16 to the replacement encoding? (With the risk that this might break pages that depend on fallback to the default encoding.) If there is no active interest into getting this into browsers, I'm not sure if we should keep this open. (Note that we have introduced a replacement encoding and disabled iso-2022-kr and hz-gb-2312 successfully, so those parts of comment 0 are addressed.) 114486 18 hsivonen 2014-11-04 14:32:32 +0000 (In reply to Anne from comment #17) > Joshua, Jungshik, Henri, is there interest in adding the labels mentioned in > comment 16 to the replacement encoding? (With the risk that this might break > pages that depend on fallback to the default encoding.) I'm still interested in this, because problem #2 from comment 0 hasn't been addressed yet. (Granted, it's a problem of insufficient clue of the part of a Web developer, but we do sometimes try to save people from themselves.) I'm not going to have time to research the problem of this potentially breaking pages that expect fallback in the foreseeable future, though. 114498 19 zcorpan 2014-11-04 17:13:40 +0000 There might also be sites that instead rely on a later encoding declaration with a different label being picked up. e.g. Content-Type: unknown ... <meta charset=known> 122694 20 annevk 2015-08-21 07:39:13 +0000 Closing this in favor of https://github.com/whatwg/encoding/issues/8 since I'd like to stop using Bugzilla.