ISSUE-19: How are non-ASCII characters handled in CSP
Interaction of CSP and IRIs
How are non-ASCII characters handled in CSP
- State:
- CLOSED
- Product:
- CSP Level 1
- Raised by:
- Brad Hill
- Opened on:
- 2012-09-11
- Description:
- Last Call comment from Boris Zbarsky <bzbarsky@MIT.EDU>:
Dear all,
I was just reading through the CSP draft, and I'm very concerned by the handling of non-ASCII characters in CSP. Specifically, I'm concerned about four things:
A) Lack of description for how one goes from an IRI or partial IRI to a
host-source expression.
B) Lack of description for how one compares a source expression to an
IRI.
C) Lack of description for how one goes from a Unicode string to
policy.
D) The fact that the current setup is likely to cause interop problems.
As far as I can tell, the current setup is as follows:
1) All CSP policies are made up of bytes in the ASCII range (and in particular, a subset of that range). Non-ASCII hostnames are expected to be encoded as punycode, I guess (though this is not actually stated anywhere; see concern A above). Non-ASCII characters in paths presumably expected to be %-encoded, but the specification doesn't say what encoding should be used for this (concern A again). In practice, by the way, at least one implementation allows non-ASCII bytes in paths, though I think the spec is pretty clear that as things stand this is not allowed.
2) When comparing a source expression to an IRI, the IRI needs to first be converted to a URI, presumably per RFC 3987. If the presumption is correct, this should probably be explicitly called out (concern B above).
3) When converting a Unicode string to a policy, presumably one does it by taking the numeric value of each codepoint and treating it as an ASCII character index? If so, this should be explicitly called out (concern C above).
In practice, I expect people to just call their favorite escape() method on their strings if they have to shoehorn them into an ASCII format, which means that we'll get a mix of %-encoding in as ISO-8859-1 and
UTF-8 at the very least, and very possibly others. The result will be lack of interop (concern D).
It seems to me that a lot of these problems were alleviated if CSP policies were defined as sequences of Unicode codepoints, with a comparison function to IRIs. The spec would also need to define how to construct such a sequence of Unicode codepoints from a Content-Security-Policy HTTP header or a Content-Security-Policy-Report-Only HTTP header, but the result would be to allow authors to use strings that actually make sense to them in CSP policies instead of shoehorning them into an ASCII-only format in likely-broken ways.
Thank you for taking the time to read all that, Boris
- Related Actions Items:
- No related actions
- Related emails:
- No related emails
Related notes:
On 9/6/12 11:48 PM, Adam Barth wrote:
> HTTP operates in terms of URIs.
Yes, but very few authors actually write HTTP servers.
> I'm not sure I understand your question. Authors deal with
> host-expressions the same way they deal with the HTTP Host header.
Authors generally don't have to author Host headers; the UA sends those.
They will, however, need to author host-expressions to actually use CSP.
>> Why not? Everything else a browser has lying around (e.g. document
>> locations) is IRIs. Are host-source expressions never compared to
>> document locations?
>
> In the end, the browser needs to translate IRIs into URIs for use in
> HTTP. Everything in CSP 1.0 is defined in terms of networking
> operations
OK, fair.
> Indeed, but that's outside the scope of CSP 1.0.
Yes, I understand that's your position. I just wish there were a way to make this stuff less of a footgun for authors...
> Actually, if your issue is with the WebKit implementation, you can
> just file a bug and I'll write a test in the course of fixing it.
https://bugs.webkit.org/show_bug.cgi?id=96061
Note that I haven't looked through the Gecko version carefully (because regexps); it may have similar problems.
> The short version is that the IETF insists that folks use IDNA2008,
> but most browsers implement something closer to IDNA2003. IDNA2008 is
> not backwards compatible with IDNA2003 and so will never actually be
> deployed. Any attempts to hammer out a browser-consensus spec get
> shouted down by folks who are pushing IDNA2008.
I see. <sigh>.
-Boris
Responses to this issue can be found in the following threads: (there are often several replies, so it is suggested to view "Contemporary messages sorted by thread".
http://lists.w3.org/Archives/Public/public-webappsec/2012Oct/0008.html
http://lists.w3.org/Archives/Public/public-webappsec/2012Oct/0025.html
The group's decision to close this issue without changing spec behavior was recorded in the minutes to the following teleconferences:
http://www.w3.org/2011/webappsec/minutes/webappsec-minutes-25-Sep-2012.html
http://www.w3.org/2011/webappsec/minutes/webappsec-minutes-23-Oct-2012.html
Display change log