This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 24257 - "Percent-decoding + full-width characters + percent decoding" for domains is missing
Summary: "Percent-decoding + full-width characters + percent decoding" for domains is ...
Status: RESOLVED WONTFIX
Alias: None
Product: WHATWG
Classification: Unclassified
Component: URL (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+urlspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 24187
  Show dependency treegraph
 
Reported: 2014-01-09 18:59 UTC by Santiago M. Mola
Modified: 2014-01-13 18:17 UTC (History)
3 users (show)

See Also:


Attachments

Description Santiago M. Mola 2014-01-09 18:59:13 UTC
(Sorry for the title, I really couldn't find anything more descriptive for this case).

- \uff05\uff14\uff11.com (%41 in Unicode full-width caracters) -> %41.com -> A.com -> a.com
- %ef%bc%85%ef%bc%94%ef%bc%91.com (the former case... percent-encoded) -> \uff05\uff14\uff11.com -> %41.com -> A.com -> a.com

Both WebKit/Blink and Gecko work with the first case. WebKit/Blink work with the second case too. None of them work if a further level of crazyness is added.

I can hardly think about a real use case where the first domain would appear, but it's... kind of possible. The second one must be an easter egg ;-)

This can be found in WebKit test cases: http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/host.html
Comment 1 Santiago M. Mola 2014-01-09 19:08:13 UTC
Note that WebKit also accepts IPs if they use full-width character. So maybe the spec could convert any full-width character to its ASCII counterpart (if any) before host parsing. And then forget about the second case of percent-encoded-full-width-percent-encoded characters.
Comment 2 Santiago M. Mola 2014-01-10 08:35:45 UTC
I've just checked the Chrome code... the logic was not crazy as I thought. The algorithm followed is, roughly:

1. Percent decode
2. IDNA2003 ToASCII
3. Percent decode-again.
4. Check for invalid characters.

So this "fullwidth percent-encoding" will only lead to a valid host if the encoded character was a valid ASCII character.

Note that Chrome runs all these steps on the whole host, not on domain labels. So a dot encoded with this method still works.
Comment 3 Anne 2014-01-11 17:23:25 UTC
How can it run the IDNA algorithm on the whole host? That seems lack a hack (which we could allow for of course).
Comment 4 Anne 2014-01-13 16:35:54 UTC
I cannot reproduce the first case in Gecko actually. If I have

<a href="http://&#xff05;&#xff14;&#xff11;.com">test</a>
<script>
 w(document.querySelector("a").href)
</script>

in

http://software.hixie.ch/utilities/js/live-dom-viewer/

what I get is http://%41.com/ which does not work as URL. (It does appear to work in the UI, which seems like a UI bug.)


In Chrome it does indeed work. But note that %2541 does not work in Chrome.

In IE fullwidth does not work. %2541 does not work. %41 does work.


So it seems that blindly percent-decoding input is wrong. The double percent-decoding Chrome does seems dangerous. Firefox does no percent-decoding. It seems the easiest would be to remove it altogether. That might also be safer.
Comment 5 Boris Zbarsky 2014-01-13 16:47:48 UTC
Note https://bugzilla.mozilla.org/show_bug.cgi?id=309671
Comment 6 Anne 2014-01-13 16:55:11 UTC
So I guess we want to percent-decode. We could percent-decode all I think, even given the %2541 case.

If after percent-decoding we find " ", "%", and other code points that ought not to occur in hosts we should return failure at that point. While going through the string checking for dangerous code points we can also lowercase "A" to "a" (domain label to ASCII does not do that if the input is all ASCII).

This would mean not supporting fullwidth percent-encoding, which Chrome only seems to support in host anyway and might well be a bug.
Comment 7 Boris Zbarsky 2014-01-13 17:07:41 UTC
> We could percent-decode all I think, even given the %2541 case.

I'm not entirely convinced we want to support that.  What do other things that process URLs do?
Comment 8 Anne 2014-01-13 17:27:06 UTC
So my idea for %2541 is that it returns in failure. See bug 24191 comment 4. So yeah, we should not support that in the end, but that does not mean that we cannot use a generic percent-decoding mechanism.
Comment 9 Anne 2014-01-13 18:17:48 UTC
Fullwidth and percent-decoding twice is bad. We keep the percent-decoding we had in place plus return failure for certain code points that should not occur in a domain.

https://github.com/whatwg/url/commit/81cdd6704ea695e1619e76794227d2c9d10d2aa7