This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
(Sorry for the title, I really couldn't find anything more descriptive for this case). - \uff05\uff14\uff11.com (%41 in Unicode full-width caracters) -> %41.com -> A.com -> a.com - %ef%bc%85%ef%bc%94%ef%bc%91.com (the former case... percent-encoded) -> \uff05\uff14\uff11.com -> %41.com -> A.com -> a.com Both WebKit/Blink and Gecko work with the first case. WebKit/Blink work with the second case too. None of them work if a further level of crazyness is added. I can hardly think about a real use case where the first domain would appear, but it's... kind of possible. The second one must be an easter egg ;-) This can be found in WebKit test cases: http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/host.html
Note that WebKit also accepts IPs if they use full-width character. So maybe the spec could convert any full-width character to its ASCII counterpart (if any) before host parsing. And then forget about the second case of percent-encoded-full-width-percent-encoded characters.
I've just checked the Chrome code... the logic was not crazy as I thought. The algorithm followed is, roughly: 1. Percent decode 2. IDNA2003 ToASCII 3. Percent decode-again. 4. Check for invalid characters. So this "fullwidth percent-encoding" will only lead to a valid host if the encoded character was a valid ASCII character. Note that Chrome runs all these steps on the whole host, not on domain labels. So a dot encoded with this method still works.
How can it run the IDNA algorithm on the whole host? That seems lack a hack (which we could allow for of course).
I cannot reproduce the first case in Gecko actually. If I have <a href="http://%41.com">test</a> <script> w(document.querySelector("a").href) </script> in http://software.hixie.ch/utilities/js/live-dom-viewer/ what I get is http://%41.com/ which does not work as URL. (It does appear to work in the UI, which seems like a UI bug.) In Chrome it does indeed work. But note that %2541 does not work in Chrome. In IE fullwidth does not work. %2541 does not work. %41 does work. So it seems that blindly percent-decoding input is wrong. The double percent-decoding Chrome does seems dangerous. Firefox does no percent-decoding. It seems the easiest would be to remove it altogether. That might also be safer.
Note https://bugzilla.mozilla.org/show_bug.cgi?id=309671
So I guess we want to percent-decode. We could percent-decode all I think, even given the %2541 case. If after percent-decoding we find " ", "%", and other code points that ought not to occur in hosts we should return failure at that point. While going through the string checking for dangerous code points we can also lowercase "A" to "a" (domain label to ASCII does not do that if the input is all ASCII). This would mean not supporting fullwidth percent-encoding, which Chrome only seems to support in host anyway and might well be a bug.
> We could percent-decode all I think, even given the %2541 case. I'm not entirely convinced we want to support that. What do other things that process URLs do?
So my idea for %2541 is that it returns in failure. See bug 24191 comment 4. So yeah, we should not support that in the end, but that does not mean that we cannot use a generic percent-decoding mechanism.
Fullwidth and percent-decoding twice is bad. We keep the percent-decoding we had in place plus return failure for certain code points that should not occur in a domain. https://github.com/whatwg/url/commit/81cdd6704ea695e1619e76794227d2c9d10d2aa7