This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 24191 - What happens with spaces in host?
Summary: What happens with spaces in host?
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: URL (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+urlspec
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 24187
  Show dependency treegraph
 
Reported: 2014-01-02 14:53 UTC by Santiago M. Mola
Modified: 2014-01-14 10:58 UTC (History)
1 user (show)

See Also:


Attachments
TEST-CHROME-31 (5.62 KB, application/octet-stream)
2014-01-13 23:28 UTC, Santiago M. Mola
Details
TEST-FIREFOX-25 (5.95 KB, application/octet-stream)
2014-01-13 23:28 UTC, Santiago M. Mola
Details
TEST-IE-11 (7.02 KB, application/octet-stream)
2014-01-13 23:28 UTC, Santiago M. Mola
Details
TEST-SAFARI-7 (5.28 KB, application/octet-stream)
2014-01-13 23:28 UTC, Santiago M. Mola
Details

Description Santiago M. Mola 2014-01-02 14:53:44 UTC
If I interpreted correctly, with the current spec, "http://ho st" will be parsed with no errors at all, resulting in host set to "ho st". This doesn't match WebKit or Gecko behaviour.

WebKit:
location.host = 'ho st'
will lead to an HTTP request to "http://ho%20st"

Gecko:
location.host = 'ho st'
will throw NS_ERROR_MALFORMED_URI
Comment 1 Anne 2014-01-06 18:15:16 UTC
Are you sure? I thought the ToASCII algorithm would fail. If it doesn't this would be a bug somewhere.
Comment 2 Santiago M. Mola 2014-01-06 21:47:54 UTC
Right. ToASCII algorithm will fail if the UseSTD3ASCIIRules flag is set. It won't otherwise.
Comment 3 Anne 2014-01-08 17:28:53 UTC
Okay, so browsers do not implement UseSTD3ASCIIRules. They allow some, such as _, but forbid others, such as space.

Le sigh.

We cannot set that flag because of the underscore. So we need a list of code points, including space, that will make the parser return failure. That seems the easiest. Maybe that could be done in the same step that lowercases certain code points.
Comment 4 Anne 2014-01-13 17:24:09 UTC
Firefox only forbids 0x00 and 0x20 at the moment.

However, if we percent-decode we should also forbid "%", "/", "\", "?", "#", and ":" as otherwise you can get re-parsing attacks.
Comment 5 Anne 2014-01-13 17:25:36 UTC
I used this to find failure code points:

<script>
 function testURL(url, cp) {
   var a = document.createElement("a")
   a.href = url
   output = cp + ": "
   if(a.host)
     output += "parsed; "
   output += a.host
   w(output)
 }
 for(var i = 0; i < 0xFF; i++) {
   var url = "http://a" + String.fromCodePoint(i) + "a/"
   testURL(url, i)
 }
</script>

in

http://software.hixie.ch/utilities/js/live-dom-viewer/
Comment 7 Santiago M. Mola 2014-01-13 23:26:39 UTC
I'm attaching results for this test (modified to work in other browsers):

<script>
 function testURL(url, cp) {
   var output = "0x" + cp.toString(16) + ": ";
   try {
     var a = document.createElement("a");
     a.href = url;
     if(a.host)
       output += "parsed; ";
     output += a.host;
   } catch (e) {
     output += e;
   }
     w(output);
 }
 for(var i = 0; i < 0xFF; i++) {
   var url = "http://a" + String.fromCharCode(i) + "a/";
   testURL(url, i);
 }
</script>

Firefox is the most permissive here. On Firefox 25, parsing fails (as in !a.host) for: 0x00, 0x20, 0x3A (':'), 

Chrome 25 only works with 0x20-0x24, 0x26-0x2E, 0x30-0x39, 0x3C-0x3E, 0x40-0x5A, 0x5F-0x7D.
Note that:
  - 0x09, 0x0A, 0x0D  are ignored before parse and shouldn't be expected after ToASCII.
  - 0x2F ('/') works with this test, but will fail if used if used to set a.host.
  - 0x3A (':') works both with this test and setting a.host, but it leads to an unexpected result (':0' and 'a:0' respectively) instead of 'a%3Aa' or 'a:a'.

Safari 7 only works with 0x25, 0x2D, 0x2E, 0x30-0x39, 0x41-0x5A, 0x5F, 0x61-0x7A.
  - 0x00, 0x3F ('#'), 0x2F ('/'), 0x3F ('?'), 0x40 ('@'), 0x5C ('\') are accepted... but truncate host.

IE 11 and 10 only works with 0x01-0x24, 0x26-0x2E, 0x30-0x39, 0x3B-0x3E, 0x41-0x5B, 0x5D-0x7F.
  - 0x00 is accepted but truncates host.
  - 0x09, 0x0A, 0x0D  are ignored before parse and shouldn't be expected after ToASCII.
  - 0x25, 0x30 will throw "Invalid argument".
  - I could not test 0x2F, 0x3F, 0x40, 0x5C properly.

For all tests, I've ignored output above 0x7F. Those shouldn't be present after ToASCII (and it would should be failure if they are).
Comment 8 Santiago M. Mola 2014-01-13 23:28:04 UTC
Created attachment 1426 [details]
TEST-CHROME-31
Comment 9 Santiago M. Mola 2014-01-13 23:28:22 UTC
Created attachment 1427 [details]
TEST-FIREFOX-25
Comment 10 Santiago M. Mola 2014-01-13 23:28:41 UTC
Created attachment 1428 [details]
TEST-IE-11
Comment 11 Santiago M. Mola 2014-01-13 23:28:59 UTC
Created attachment 1429 [details]
TEST-SAFARI-7
Comment 12 Santiago M. Mola 2014-01-13 23:49:10 UTC
Honestly, I can't see any benefit in accepting anything not in the 0-9a-z_ range after ToASCII is run. This is the only range that works consistently across browsers and it's somewhat close to previous standards (including IDNA), being "_" the only deviation, which is justified by widespread use in the real world.
Comment 13 Anne 2014-01-14 10:50:41 UTC
The fix here was wrong. It needs to be happen after ToASCII has run.

The idea behind letting through most ASCII code points is to allow for weirdly configured intranet environments. A network error seems preferable to not being able to type in an address.