20886 – Strengthen definition of "ASCII-compatible character encoding" to match §4.2.5.5

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 20886 - Strengthen definition of "ASCII-compatible character encoding" to match §4.2.5.5

Summary: Strengthen definition of "ASCII-compatible character encoding" to match §4.2.5.5

Status:	RESOLVED NEEDSINFO

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	HTML (show other bugs)
Version:	unspecified
Hardware:	Other other

Importance:	P3 normal
Target Milestone:	Unsorted
Assignee:	Ian 'Hixie' Hickson
QA Contact:	contributor

URL:	http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-02-06 16:25 UTC by contributor
Modified:	2013-05-31 20:13 UTC (History)
CC List:	5 users (show)

See Also:

Attachments

Description contributor 2013-02-06 16:25:11 UTC

Specification: http://www.whatwg.org/specs/web-apps/current-work/
Multipage: http://www.whatwg.org/C#encoding-terminology
Complete: http://www.whatwg.org/c#encoding-terminology

Comment:
I think the definition of "ASCII-compatible character encoding" should be
strengthened to match the language at §4.2.5.5 about potentially dangerous
encodings.

Posted from: 108.17.82.100
User agent: Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 Iceweasel/18.0.1

Comment 1 Zack Weinberg 2013-02-06 16:35:50 UTC

Elaboration: "ASCII-compatible character encoding" as defined in §2.1.6 is weak enough to permit character sets (such as HZ-GB-2312) that are specifically called out in §4.2.5.5 as unsafe due to risk of misinterpretation.  I would like to suggest changing the definition to read

    An _ASCII-compatible character encoding_ is a single-byte or variable-length encoding in which the bytes 0x09, 0x0A, 0x0C, 0x0D, and 0x20 -- 0x7E always encode Unicode characters U+0009, U+000A, U+000C, U+000D, and U+0020 -- U+007E respectively.

If this is done, the note immediately after the definition should also be changed to read

    Note: This includes UTF-8 and most of the single-byte character encodings (such as all variants of ISO 8859) still in wide use.  It excludes UTF-16 as well as some variable-length encodings (such as Shift_JIS, HZ-GB-2312, and variants of ISO-2022) in which it is possible for bytes in the 0x20--0x7E range to be part of longer sequences that are unrelated to their interpretation as ASCII.  It also excludes various obsolete encodings such as UTF-7, GSM03.38, and EBCDIC.

Comment 2 Ian 'Hixie' Hickson 2013-02-06 22:02:49 UTC

Can you construct a file with those encodings that lets user-generated content execute script because of tricking the UA into using the wrong encoding?

Comment 3 Zack Weinberg 2013-02-06 22:55:57 UTC

(In reply to comment #2)
> Can you construct a file with those encodings that lets user-generated
> content execute script because of tricking the UA into using the wrong
> encoding?

It is possible to construct a HZ-GB-2312-encoded HTML file that executes script if the UA does *not* support that encoding:

  <!doctype html>
  <html><head>
    <meta charset="hz-gb-2312">
    <title>script executed only if hz-gb-2312 unsupported</title>
  </head><body>
  ~{<script>alert("gotcha")</script>~}
  </body></html>

(The text between ~{ and ~} must not contain any spaces, as single bytes of the double-byte GB2312 encoding must have values between 0x21 and 0x7E inclusive.  0x78 through 0x7E inclusive cannot appear at the first, third, fifth, etc. byte positions between ~{ and ~} but this seems like only a minor hindrance to an exploit, despite { and } being in that range.)

Something similar is probably possible in Big5.  I have not, however, been able to find an exploit that goes the other way, i.e. script is executed only if the UA *does* support HZ-GB-2312 or Big5.

Comment 4 Ian 'Hixie' Hickson 2013-02-08 02:29:10 UTC

In theory the set of encodings that are supported is now fixed, so I'm not particularly worried about attacks that rely on one vendor supporting one set of encodings and another not — the solution is just "supporting the standard set of encodings". My concern is more with cases like a page that thinks it's in one encoding (e.g. ASCII or HZ-GB-2312) but attacker-inserted text is able to get interpreted as another encoding. For example, in your case, if there was a way to trick the browser into not parsing the page as HZ-GB-2312 (other than the browser not supporting it) then that would be a concern.

Comment 5 Zack Weinberg 2013-02-18 18:23:38 UTC

I just thought of a social engineering attack: Write a page which declares itself as HZ-GB-2312 (for instance).  Invite visitors to copy and paste what appears to be a string of nonsense hanzi onto a victim site (which is encoded in UTF-8 and allows user comments).  This is a bit of a stretch since I think the clipboard will probably get recoded on paste (but I'm not sure of it) or on upload (but I'm not sure of that either) and even it doesn't, a site that's not *expecting* HZ-GB-2312 ought to notice the smuggled <script> tag in its regular old anti-lulz filters.

Comment 6 Anne 2013-02-19 10:57:28 UTC

I don't understand the attack in comment 5. Are you assuming the browser would not store the clipboard data as text (encoded as utf-16 most likely) but rather as the raw bytes or some such?

Comment 7 Zack Weinberg 2013-02-19 14:33:05 UTC

Yes, I was imagining that the raw bytes would be transferred.  (I honestly have no idea how clipboards work nowadays.)

Comment 8 Ian 'Hixie' Hickson 2013-04-12 18:56:51 UTC

Clipboards copy the text as Unicode characters, so I don't think the attack in comment 5 would work.

Any other attacks?