19961 – Write security considerations

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 19961 - Write security considerations

Summary: Write security considerations

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC Windows 3.1

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-11-14 14:09 UTC by Anne
Modified:	2014-12-06 21:32 UTC (History)
CC List:	4 users (show)

See Also:

Attachments

Description Anne 2012-11-14 14:09:08 UTC

https://bugzilla.mozilla.org/show_bug.cgi?id=406777#c1
http://zaynar.co.uk/docs/charset-encoding-xss.html
https://bugzilla.mozilla.org/show_bug.cgi?id=690225

Comment 1 Masatoshi Kimura 2012-12-04 11:27:15 UTC

Why platforms are changed to "Windows 3.1" for all bugs?

Comment 2 Anne 2014-04-11 11:22:00 UTC

Encoder HTML error mode has potential for silent data loss when not using utf-8.

Comment 3 Anne 2014-11-09 17:22:18 UTC

I wrote a draft for this section. Review appreciated.

===
There is a set of security problems related to encodings when the producer and consumer do not agree on the encoding in use, or on the way a given encoding is to be implemented. For instance, an attack was reported in 2011 where a <span>shift_jis</span> lead byte 0x82 was used to “mask” a 0x22 trail byte in a JSON resource of which an attacker could control some field. The producer did not see the problem even though this is an illegal byte combination. The consumer decoded it as a single U+FFFD and therefore changed the overall interpretation as U+0022 is an important delimiter. Decoders of encodings that use multiple bytes for scalar values now require that in case of an illegal byte combination, a scalar value in the U+0000 to U+007F range cannot be “masked”. For the aforementioned sequence the output would be U+FFFD U+0022.

This is an even bigger problem with encodings that map anything in the 0x00 to 0x7F range to something other than U+0000 to U+007F, when there is no lead byte present. These are “ASCII-incompatible” encodings and other than <span>iso-2022-jp</span>, <span>utf-16be</span>, and <span>utf16-le</span>, which are unfortunately required by legacy content, they are not supported. (Investigation is <a href="https://www.w3.org/Bugs/Public/show_bug.cgi?id=21057" title="Introduce additional labels for the replacement encoding">ongoing</a> whether more labels of these encodings can be mapped to the <span>replacement</span> encoding.) An attack here can be injecting carefully crafted content into a resource and then encouraging the user to override the encoding, resulting in script execution. Browsers are strongly encouraged to disable character encoding overrides for resources using one of the aforementioned problematic encodings.

<hr>

Encoders used by URLs found in HTML and HTML's form feature can also result in slight information loss when an encoding is used that cannot represent all scalar values. E.g. when a resource uses the <span>windows-1252</span> encoding a server will not be able to distinguish between an end user entering “

Comment 4 Anne 2014-11-10 14:27:41 UTC

Here is the last bit starting at the <hr> above as it was cut off due to using a non-BMP code point (replaced with U+... reference):

===
Encoders used by URLs found in HTML and HTML's form feature can also result in slight information loss when an encoding is used that cannot represent all scalar values. E.g. when a resource uses the <span>windows-1252</span> encoding a server will not be able to distinguish between an end user entering U+1F4A9 and “&amp;#128169;” into a form.

<hr>

The problems outlined here go away when exclusively using utf-8, which is one of the many reasons that is now the mandatory encoding for all things.
===

Comment 5 Henri Sivonen 2014-12-05 08:09:59 UTC

"Browsers are strongly encouraged to disable character encoding overrides for resources using one of the aforementioned problematic encodings."

Please clarify that browsers should both:
 1) Not offer UTF-16 as a manual override.
 2) Ignore manual overrides for resources that are UTF-16 to begin with.

I'm unsure if the above should apply to ISO-2022-JP. I haven't seen a PoC of an attack either way, and Firefox currently allows override both to and from ISO-2022-JP.

Comment 6 Anne 2014-12-06 21:32:16 UTC

Done.

https://github.com/whatwg/encoding/commit/2e43ead5c796e314cd3aaada10a2dc33de7bfaf1