21057 – Introduce additional labels for the replacement encoding

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 21057 - Introduce additional labels for the replacement encoding

Summary: Introduce additional labels for the replacement encoding

Status:	RESOLVED MOVED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:	blocked on implementer research
Keywords:

Depends on:
Blocks:

Reported:	2013-02-20 14:19 UTC by Henri Sivonen
Modified:	2015-08-21 07:39 UTC (History)
CC List:	6 users (show)

See Also:

Attachments

Description Henri Sivonen 2013-02-20 14:19:27 UTC

Problem statement:

1) The Encoding Standard removes the ISO-2022-CN encoding. This will make sites that rely on that encoding being supported vulnerable to XSS the way Yahoo search was vulnerable in Chrome when Chrome removed ISO-2022-KR. See https://code.google.com/p/chromium/issues/detail?id=15701

2) There exist ASCII-incompatible encodings in the world outside the Encoding Standard and support for those encodings might be exposed if server-side libraries. Sites that are naïve enough to allow the user to specify the output encoding that the site uses and this past the user-supplied encoding name to server-side library without white listing ASCII-compatible encodings are vulnerable to EBCDIC attacks: An attacker can request that the site use an EBCDIC-based encoding and the site responds with EBCDIC which isn't recognized by non-IE browsers and browsers fall back on an ASCII-compatible encoding resulting in the EBCDIC bytes being interpreted in a dangerous way. See http://zaynar.co.uk/docs/charset-encoding-xss.html for a reference to an actual search engine that was vulnerable to this attack.

Proposed solution:
Define a replacement encoding that decodes all possible byte values to the REPLACEMENT CHARACTER. Make the known labels for ASCII-incompatible encodings that exist but aren't part of the Encoding Standard aliases for the replacement encoding.

Additional info:
This solution would pave the way for safe removal of ISO-2022-KR and hz-gb-2312 from the set of encodings supported by the Encoding Standard.

Comment 1 Anne 2013-02-20 14:42:51 UTC

We should be conservative with this list I suppose as sites might rely on a fallback encoding being in play.

We should probably include these:

* iso-2022-cn
* iso-2022-cn-ext

Less sure about:

* EBCDIC labels
* utf-7
* utf-32

Comment 2 Henri Sivonen 2013-02-20 15:01:01 UTC

Hmm. Might want to allow 0x20 to decode as U+0020 to avoid accidentally DoSing layout.

(In reply to comment #1)
> Less sure about:
> 
> * EBCDIC labels

To the extent IE currently recognizes these, at least in theory:
 * Relying on falling back to ASCII doesn't work today in IE.
 * IE would become less XSS-resilient if it dropped knowledge of those labels without aliasing them to a replacement encoding.

> * utf-7
> * utf-32

These might plausibly be relying on fallback currently.

Others to consider:
 * CESU-8
 * BOCU-1
 * SCSU

Comment 3 Anne 2013-02-20 15:12:13 UTC

We could emit a single U+FFFD and terminate I think. Pretend as if all bytes were consumed.

Comment 4 Masatoshi Kimura 2013-02-20 19:20:09 UTC

Chrome implements a "fake" ISO-2022-CN decoder which always emit U+FFFD for all double-byte characters.

Comment 5 Masatoshi Kimura 2013-02-20 19:26:08 UTC

But I don't think ISO-2022-CN problem is really exploitable in the read world.
Gecko's ISO-2022-CN decoder has a bug for a long time which is exploitable. I even wrote it in the public bug. But nobody didn't care.
https://bugzilla.mozilla.org/show_bug.cgi?id=470523
So Gecko completely ignores ISO-2022-CN label since Firefox 19.

Comment 6 Anne 2013-02-22 10:55:57 UTC

We also have to decide what to do for TextDecoder. And the encoder story for <form accept-charset>, script injecting a link into a iso-2022-kr <iframe>, and maybe more.

I think the encoder story can be utf-8. Supporting it in TextDecoder does not seem problematic. TextEncoder is already prohibited.

Comment 7 Masatoshi Kimura 2013-02-22 19:11:52 UTC

I think we should also remove them from TextDecoder for consistency. If people really need to decode those encodings, they can implement the decoder using gbk/euc-kr decoders.

Comment 8 Masatoshi Kimura 2013-02-22 19:14:25 UTC

(In reply to comment #2)
> Others to consider:
>  * CESU-8
>  * BOCU-1
>  * SCSU

I don't think we need to consider about encodings no browsers have ever been supported. If by any chance some pages relied on those encodings, they are already vulnerable.

Comment 9 Anne 2013-02-22 19:34:42 UTC

What is the rationale for that? They might already be vulnerable, but would it not be better if they were less vulnerable going forward?

Comment 10 Masatoshi Kimura 2013-02-22 19:37:54 UTC

If the vulnerable page is actually present in the real world at all.

Comment 11 Anne 2013-02-22 19:41:52 UTC

As a trial balloon: https://github.com/whatwg/encoding/commit/8329a2e768caea6908d600debd3cc8a6dc59c3c3 (I.e. not final, but gives us a thing to discuss.)

Comment 12 Anne 2013-02-23 08:03:18 UTC

So going forward everything under EBCDIC in http://wiki.whatwg.org/wiki/Web_Encodings#Encodings_3 should be added. Let me know if you disagree.

Then once implementations remove iso-2022-kr I will add that one too.

Comment 13 Henri Sivonen 2013-02-25 11:43:53 UTC

(In reply to comment #8)
> (In reply to comment #2)
> > Others to consider:
> >  * CESU-8
> >  * BOCU-1
> >  * SCSU
> 
> I don't think we need to consider about encodings no browsers have ever been
> supported. If by any chance some pages relied on those encodings, they are
> already vulnerable.

The threat scenario is that the server accepts an encoding name from a query string and passes it to a server-side library that implements encodings that browsers don't support.

Comment 14 Anne 2013-12-02 14:17:38 UTC

Per http://mxr.mozilla.org/mozilla-central/source/dom/encoding/labelsencodings.properties Gecko seems to match the specification. Do we want to add any of the other ones or should I resolve this as FIXED?

Comment 15 Henri Sivonen 2013-12-04 10:15:55 UTC

(In reply to Anne from comment #14)
> Do we want to add any of
> the other ones or should I resolve this as FIXED?

I think we should add
 * BOCU-1
 * SCSU
 * Known EBCDIC labels.
...as labels of the replacement encoding in order to mitigate the attack described in http://zaynar.co.uk/docs/charset-encoding-xss.html . If Google Translate works for http://masatokinugawa.l0.cm/2013/06/accounts.google.com-utf-32-xss.html , it appears that Google, who really should know better, allowed the output encoding to be controlled by the request URL.

UTF-7 is not on the list, because it's not dangerous to interpret UTF-7 as ASCII and there's some value in seeing the ASCII decoding of UTF-7 for Latin-script text.

UTF-32 is not on the list, because the BOM taking precedence and the little-endian UTF-32 sniffing as UTF-16LE would make aliasing to replacement a mere placebo. Furthermore, interpreting UTF-32 as non-UTF-32 doesn't appear to be dangerous when U+0000 is not discarded before tokenization, which it isn't in HTML.

Comment 16 Anne 2013-12-04 16:05:36 UTC

These are the known EBCDIC ones that IE supports per the WHATWG table including BOCU-1 and SCSU labels:

 * bocu-1
 * ccsid00924
 * ccsid01140
 * ccsid01141
 * ccsid01142
 * ccsid01143
 * ccsid01144
 * ccsid01145
 * ccsid01146
 * ccsid01147
 * ccsid01148
 * ccsid01149
 * cp00924
 * cp01140
 * cp01141
 * cp01142
 * cp01143
 * cp01144
 * cp01145
 * cp01146
 * cp01147
 * cp01148
 * cp01149
 * cp037
 * cp1025
 * cp1026
 * cp273
 * cp278
 * cp280
 * cp284
 * cp285
 * cp290
 * cp297
 * cp420
 * cp423
 * cp424
 * cp500
 * cp870
 * cp871
 * cp875
 * cp880
 * cp905
 * cp930
 * cp933
 * cp935
 * cp937
 * cp939
 * csbocu-1
 * csbocu1
 * csibm037
 * csibm1026
 * csibm273
 * csibm277
 * csibm278
 * csibm280
 * csibm284
 * csibm285
 * csibm290
 * csibm297
 * csibm420
 * csibm423
 * csibm424
 * csibm500
 * csibm870
 * csibm871
 * csibm880
 * csibm905
 * csibmthai
 * csscsu
 * ebcdic-cp-ar1
 * ebcdic-cp-be
 * ebcdic-cp-ca
 * ebcdic-cp-ch
 * ebcdic-cp-dk
 * ebcdic-cp-es
 * ebcdic-cp-fi
 * ebcdic-cp-fr
 * ebcdic-cp-gb
 * ebcdic-cp-gr
 * ebcdic-cp-he
 * ebcdic-cp-is
 * ebcdic-cp-it
 * ebcdic-cp-nl
 * ebcdic-cp-no
 * ebcdic-cp-roece
 * ebcdic-cp-se
 * ebcdic-cp-tr
 * ebcdic-cp-us
 * ebcdic-cp-wt
 * ebcdic-cp-yu
 * ebcdic-cyrillic
 * ebcdic-de-273+euro
 * ebcdic-dk-277+euro
 * ebcdic-es-284+euro
 * ebcdic-fi-278+euro
 * ebcdic-fr-297+euro
 * ebcdic-gb-285+euro
 * ebcdic-international-500+euro
 * ebcdic-is-871+euro
 * ebcdic-it-280+euro
 * ebcdic-jp-kana
 * ebcdic-latin9--euro
 * ebcdic-no-277+euro
 * ebcdic-se-278+euro
 * ebcdic-us-37+euro
 * ibm-thai
 * ibm00924
 * ibm01047
 * ibm01140
 * ibm01141
 * ibm01142
 * ibm01143
 * ibm01144
 * ibm01145
 * ibm01146
 * ibm01147
 * ibm01148
 * ibm01149
 * ibm037
 * ibm1026
 * ibm273
 * ibm277
 * ibm278
 * ibm280
 * ibm284
 * ibm285
 * ibm290
 * ibm297
 * ibm420
 * ibm423
 * ibm424
 * ibm500
 * ibm870
 * ibm871
 * ibm880
 * ibm905
 * scsu
 * x-cp21027
 * x-ebcdic-japaneseanduscanada
 * x-ebcdic-koreanextended

Of course on the server ICU might be used and which labels we want to ban from that is unclear to me. ICU supports a lot of labels, including weird ones like "ISO_2022,locale=ko,version=0".

Comment 17 Anne 2014-11-04 13:54:01 UTC

Joshua, Jungshik, Henri, is there interest in adding the labels mentioned in comment 16 to the replacement encoding? (With the risk that this might break pages that depend on fallback to the default encoding.)

If there is no active interest into getting this into browsers, I'm not sure if we should keep this open. 

(Note that we have introduced a replacement encoding and disabled iso-2022-kr and hz-gb-2312 successfully, so those parts of comment 0 are addressed.)

Comment 18 Henri Sivonen 2014-11-04 14:32:32 UTC

(In reply to Anne from comment #17)
> Joshua, Jungshik, Henri, is there interest in adding the labels mentioned in
> comment 16 to the replacement encoding? (With the risk that this might break
> pages that depend on fallback to the default encoding.)

I'm still interested in this, because problem #2 from comment 0 hasn't been addressed yet. (Granted, it's a problem of insufficient clue of the part of a Web developer, but we do sometimes try to save people from themselves.)

I'm not going to have time to research the problem of this potentially breaking pages that expect fallback in the foreseeable future, though.

Comment 19 Simon Pieters 2014-11-04 17:13:40 UTC

There might also be sites that instead rely on a later encoding declaration with a different label being picked up. e.g.

Content-Type: unknown
...
<meta charset=known>

Comment 20 Anne 2015-08-21 07:39:13 UTC

Closing this in favor of https://github.com/whatwg/encoding/issues/8 since I'd like to stop using Bugzilla.