This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Problem statement: 1) The Encoding Standard removes the ISO-2022-CN encoding. This will make sites that rely on that encoding being supported vulnerable to XSS the way Yahoo search was vulnerable in Chrome when Chrome removed ISO-2022-KR. See https://code.google.com/p/chromium/issues/detail?id=15701 2) There exist ASCII-incompatible encodings in the world outside the Encoding Standard and support for those encodings might be exposed if server-side libraries. Sites that are naïve enough to allow the user to specify the output encoding that the site uses and this past the user-supplied encoding name to server-side library without white listing ASCII-compatible encodings are vulnerable to EBCDIC attacks: An attacker can request that the site use an EBCDIC-based encoding and the site responds with EBCDIC which isn't recognized by non-IE browsers and browsers fall back on an ASCII-compatible encoding resulting in the EBCDIC bytes being interpreted in a dangerous way. See http://zaynar.co.uk/docs/charset-encoding-xss.html for a reference to an actual search engine that was vulnerable to this attack. Proposed solution: Define a replacement encoding that decodes all possible byte values to the REPLACEMENT CHARACTER. Make the known labels for ASCII-incompatible encodings that exist but aren't part of the Encoding Standard aliases for the replacement encoding. Additional info: This solution would pave the way for safe removal of ISO-2022-KR and hz-gb-2312 from the set of encodings supported by the Encoding Standard.
We should be conservative with this list I suppose as sites might rely on a fallback encoding being in play. We should probably include these: * iso-2022-cn * iso-2022-cn-ext Less sure about: * EBCDIC labels * utf-7 * utf-32
Hmm. Might want to allow 0x20 to decode as U+0020 to avoid accidentally DoSing layout. (In reply to comment #1) > Less sure about: > > * EBCDIC labels To the extent IE currently recognizes these, at least in theory: * Relying on falling back to ASCII doesn't work today in IE. * IE would become less XSS-resilient if it dropped knowledge of those labels without aliasing them to a replacement encoding. > * utf-7 > * utf-32 These might plausibly be relying on fallback currently. Others to consider: * CESU-8 * BOCU-1 * SCSU
We could emit a single U+FFFD and terminate I think. Pretend as if all bytes were consumed.
Chrome implements a "fake" ISO-2022-CN decoder which always emit U+FFFD for all double-byte characters.
But I don't think ISO-2022-CN problem is really exploitable in the read world. Gecko's ISO-2022-CN decoder has a bug for a long time which is exploitable. I even wrote it in the public bug. But nobody didn't care. https://bugzilla.mozilla.org/show_bug.cgi?id=470523 So Gecko completely ignores ISO-2022-CN label since Firefox 19.
We also have to decide what to do for TextDecoder. And the encoder story for <form accept-charset>, script injecting a link into a iso-2022-kr <iframe>, and maybe more. I think the encoder story can be utf-8. Supporting it in TextDecoder does not seem problematic. TextEncoder is already prohibited.
I think we should also remove them from TextDecoder for consistency. If people really need to decode those encodings, they can implement the decoder using gbk/euc-kr decoders.
(In reply to comment #2) > Others to consider: > * CESU-8 > * BOCU-1 > * SCSU I don't think we need to consider about encodings no browsers have ever been supported. If by any chance some pages relied on those encodings, they are already vulnerable.
What is the rationale for that? They might already be vulnerable, but would it not be better if they were less vulnerable going forward?
If the vulnerable page is actually present in the real world at all.
As a trial balloon: https://github.com/whatwg/encoding/commit/8329a2e768caea6908d600debd3cc8a6dc59c3c3 (I.e. not final, but gives us a thing to discuss.)
So going forward everything under EBCDIC in http://wiki.whatwg.org/wiki/Web_Encodings#Encodings_3 should be added. Let me know if you disagree. Then once implementations remove iso-2022-kr I will add that one too.
(In reply to comment #8) > (In reply to comment #2) > > Others to consider: > > * CESU-8 > > * BOCU-1 > > * SCSU > > I don't think we need to consider about encodings no browsers have ever been > supported. If by any chance some pages relied on those encodings, they are > already vulnerable. The threat scenario is that the server accepts an encoding name from a query string and passes it to a server-side library that implements encodings that browsers don't support.
Per http://mxr.mozilla.org/mozilla-central/source/dom/encoding/labelsencodings.properties Gecko seems to match the specification. Do we want to add any of the other ones or should I resolve this as FIXED?
(In reply to Anne from comment #14) > Do we want to add any of > the other ones or should I resolve this as FIXED? I think we should add * BOCU-1 * SCSU * Known EBCDIC labels. ...as labels of the replacement encoding in order to mitigate the attack described in http://zaynar.co.uk/docs/charset-encoding-xss.html . If Google Translate works for http://masatokinugawa.l0.cm/2013/06/accounts.google.com-utf-32-xss.html , it appears that Google, who really should know better, allowed the output encoding to be controlled by the request URL. UTF-7 is not on the list, because it's not dangerous to interpret UTF-7 as ASCII and there's some value in seeing the ASCII decoding of UTF-7 for Latin-script text. UTF-32 is not on the list, because the BOM taking precedence and the little-endian UTF-32 sniffing as UTF-16LE would make aliasing to replacement a mere placebo. Furthermore, interpreting UTF-32 as non-UTF-32 doesn't appear to be dangerous when U+0000 is not discarded before tokenization, which it isn't in HTML.
These are the known EBCDIC ones that IE supports per the WHATWG table including BOCU-1 and SCSU labels: * bocu-1 * ccsid00924 * ccsid01140 * ccsid01141 * ccsid01142 * ccsid01143 * ccsid01144 * ccsid01145 * ccsid01146 * ccsid01147 * ccsid01148 * ccsid01149 * cp00924 * cp01140 * cp01141 * cp01142 * cp01143 * cp01144 * cp01145 * cp01146 * cp01147 * cp01148 * cp01149 * cp037 * cp1025 * cp1026 * cp273 * cp278 * cp280 * cp284 * cp285 * cp290 * cp297 * cp420 * cp423 * cp424 * cp500 * cp870 * cp871 * cp875 * cp880 * cp905 * cp930 * cp933 * cp935 * cp937 * cp939 * csbocu-1 * csbocu1 * csibm037 * csibm1026 * csibm273 * csibm277 * csibm278 * csibm280 * csibm284 * csibm285 * csibm290 * csibm297 * csibm420 * csibm423 * csibm424 * csibm500 * csibm870 * csibm871 * csibm880 * csibm905 * csibmthai * csscsu * ebcdic-cp-ar1 * ebcdic-cp-be * ebcdic-cp-ca * ebcdic-cp-ch * ebcdic-cp-dk * ebcdic-cp-es * ebcdic-cp-fi * ebcdic-cp-fr * ebcdic-cp-gb * ebcdic-cp-gr * ebcdic-cp-he * ebcdic-cp-is * ebcdic-cp-it * ebcdic-cp-nl * ebcdic-cp-no * ebcdic-cp-roece * ebcdic-cp-se * ebcdic-cp-tr * ebcdic-cp-us * ebcdic-cp-wt * ebcdic-cp-yu * ebcdic-cyrillic * ebcdic-de-273+euro * ebcdic-dk-277+euro * ebcdic-es-284+euro * ebcdic-fi-278+euro * ebcdic-fr-297+euro * ebcdic-gb-285+euro * ebcdic-international-500+euro * ebcdic-is-871+euro * ebcdic-it-280+euro * ebcdic-jp-kana * ebcdic-latin9--euro * ebcdic-no-277+euro * ebcdic-se-278+euro * ebcdic-us-37+euro * ibm-thai * ibm00924 * ibm01047 * ibm01140 * ibm01141 * ibm01142 * ibm01143 * ibm01144 * ibm01145 * ibm01146 * ibm01147 * ibm01148 * ibm01149 * ibm037 * ibm1026 * ibm273 * ibm277 * ibm278 * ibm280 * ibm284 * ibm285 * ibm290 * ibm297 * ibm420 * ibm423 * ibm424 * ibm500 * ibm870 * ibm871 * ibm880 * ibm905 * scsu * x-cp21027 * x-ebcdic-japaneseanduscanada * x-ebcdic-koreanextended Of course on the server ICU might be used and which labels we want to ban from that is unclear to me. ICU supports a lot of labels, including weird ones like "ISO_2022,locale=ko,version=0".
Joshua, Jungshik, Henri, is there interest in adding the labels mentioned in comment 16 to the replacement encoding? (With the risk that this might break pages that depend on fallback to the default encoding.) If there is no active interest into getting this into browsers, I'm not sure if we should keep this open. (Note that we have introduced a replacement encoding and disabled iso-2022-kr and hz-gb-2312 successfully, so those parts of comment 0 are addressed.)
(In reply to Anne from comment #17) > Joshua, Jungshik, Henri, is there interest in adding the labels mentioned in > comment 16 to the replacement encoding? (With the risk that this might break > pages that depend on fallback to the default encoding.) I'm still interested in this, because problem #2 from comment 0 hasn't been addressed yet. (Granted, it's a problem of insufficient clue of the part of a Web developer, but we do sometimes try to save people from themselves.) I'm not going to have time to research the problem of this potentially breaking pages that expect fallback in the foreseeable future, though.
There might also be sites that instead rely on a later encoding declaration with a different label being picked up. e.g. Content-Type: unknown ... <meta charset=known>
Closing this in favor of https://github.com/whatwg/encoding/issues/8 since I'd like to stop using Bugzilla.