20886 2013-02-06 16:25:11 +0000 Strengthen definition of "ASCII-compatible character encoding" to match §4.2.5.5 2013-05-31 20:13:05 +0000 1 1 1 Unclassified WHATWG HTML unspecified Other other RESOLVED NEEDSINFO http://www.whatwg.org/specs/web-apps/current-work/#encoding-terminology P3 normal Unsorted 1 contributor ian annevk ian mike VYV03354 zackw contributor oldest_to_newest 82627 0 contributor 2013-02-06 16:25:11 +0000 Specification: http://www.whatwg.org/specs/web-apps/current-work/ Multipage: http://www.whatwg.org/C#encoding-terminology Complete: http://www.whatwg.org/c#encoding-terminology Comment: I think the definition of "ASCII-compatible character encoding" should be strengthened to match the language at §4.2.5.5 about potentially dangerous encodings. Posted from: 108.17.82.100 User agent: Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 Iceweasel/18.0.1 82628 1 zackw 2013-02-06 16:35:50 +0000 Elaboration: "ASCII-compatible character encoding" as defined in §2.1.6 is weak enough to permit character sets (such as HZ-GB-2312) that are specifically called out in §4.2.5.5 as unsafe due to risk of misinterpretation. I would like to suggest changing the definition to read An _ASCII-compatible character encoding_ is a single-byte or variable-length encoding in which the bytes 0x09, 0x0A, 0x0C, 0x0D, and 0x20 -- 0x7E always encode Unicode characters U+0009, U+000A, U+000C, U+000D, and U+0020 -- U+007E respectively. If this is done, the note immediately after the definition should also be changed to read Note: This includes UTF-8 and most of the single-byte character encodings (such as all variants of ISO 8859) still in wide use. It excludes UTF-16 as well as some variable-length encodings (such as Shift_JIS, HZ-GB-2312, and variants of ISO-2022) in which it is possible for bytes in the 0x20--0x7E range to be part of longer sequences that are unrelated to their interpretation as ASCII. It also excludes various obsolete encodings such as UTF-7, GSM03.38, and EBCDIC. 82652 2 ian 2013-02-06 22:02:49 +0000 Can you construct a file with those encodings that lets user-generated content execute script because of tricking the UA into using the wrong encoding? 82659 3 zackw 2013-02-06 22:55:57 +0000 (In reply to comment #2) > Can you construct a file with those encodings that lets user-generated > content execute script because of tricking the UA into using the wrong > encoding? It is possible to construct a HZ-GB-2312-encoded HTML file that executes script if the UA does *not* support that encoding: <!doctype html> <html><head> <meta charset="hz-gb-2312"> <title>script executed only if hz-gb-2312 unsupported</title> </head><body> ~{<script>alert("gotcha")</script>~} </body></html> (The text between ~{ and ~} must not contain any spaces, as single bytes of the double-byte GB2312 encoding must have values between 0x21 and 0x7E inclusive. 0x78 through 0x7E inclusive cannot appear at the first, third, fifth, etc. byte positions between ~{ and ~} but this seems like only a minor hindrance to an exploit, despite { and } being in that range.) Something similar is probably possible in Big5. I have not, however, been able to find an exploit that goes the other way, i.e. script is executed only if the UA *does* support HZ-GB-2312 or Big5. 82729 4 ian 2013-02-08 02:29:10 +0000 In theory the set of encodings that are supported is now fixed, so I'm not particularly worried about attacks that rely on one vendor supporting one set of encodings and another not — the solution is just "supporting the standard set of encodings". My concern is more with cases like a page that thinks it's in one encoding (e.g. ASCII or HZ-GB-2312) but attacker-inserted text is able to get interpreted as another encoding. For example, in your case, if there was a way to trick the browser into not parsing the page as HZ-GB-2312 (other than the browser not supporting it) then that would be a concern. 83283 5 zackw 2013-02-18 18:23:38 +0000 I just thought of a social engineering attack: Write a page which declares itself as HZ-GB-2312 (for instance). Invite visitors to copy and paste what appears to be a string of nonsense hanzi onto a victim site (which is encoded in UTF-8 and allows user comments). This is a bit of a stretch since I think the clipboard will probably get recoded on paste (but I'm not sure of it) or on upload (but I'm not sure of that either) and even it doesn't, a site that's not *expecting* HZ-GB-2312 ought to notice the smuggled <script> tag in its regular old anti-lulz filters. 83314 6 annevk 2013-02-19 10:57:28 +0000 I don't understand the attack in comment 5. Are you assuming the browser would not store the clipboard data as text (encoded as utf-16 most likely) but rather as the raw bytes or some such? 83326 7 zackw 2013-02-19 14:33:05 +0000 Yes, I was imagining that the raw bytes would be transferred. (I honestly have no idea how clipboards work nowadays.) 86057 8 ian 2013-04-12 18:56:51 +0000 Clipboards copy the text as Unicode characters, so I don't think the attack in comment 5 would work. Any other attacks?