This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Right now, IPv4 addresses (not IPv6 with an IPv4-like suffix) are considered to be just like DNS domain names. It turns out there are many "interesting" variations of syntax for IPv4 (hexadecimal or octal numbers, combining the last 2~4 parts in one 16~24 bit number, ...) https://github.com/w3c/web-platform-tests/issues/1104 The spec should define what to do: what syntax is accepted or make the URL invalid, what normalization happens if any, what is considered an IPv4 address or a (DNS) domain name. Right now a "host" in the spec is either a "domain" (a string) or an "IPv6 address" (eight 16 bit integers). I suggest adding a variant so it can also be an "IPv4 address" (four 8 bit integers).
Beyond the usual "undefined things hurt interop", this is probably a security issue where non-equivalent URLs looks like each other (phishing), or equivalent URLs look different (by passing URL-as-string-based access control). I don’t know what the right thing to do is, though.
Also: does Nameprep / IDNA normalization (such as full-width CJK digits to ASCII digits) apply to IPv4 addresses? If so, should it apply to IPv6 too? https://github.com/w3c/web-platform-tests/blob/2712b5611a4e048e04a7dc814a7a31413d2d367a/url/urltestdata.txt#L315-L317
So the question here is whether parsing a URL should return an object that's an IPv4 address or have IPv4 stored in the same way as a domain and let the network deal with it. From what I'm told currently the network layer deals with, despite that some older RFCs suggests it should also be distinguished at the URL layer.
As far as I understand, in C/C++ on a POSIX system, applications that want to open a TCP connection typically: * Call gethostbyname() or getaddrinfo() (provided by libc) with the "host" as a string, and return a 32 bit or 128 bit address. These functions simply do string parsing if the input looks like an IPv4 or IPv6 address, otherwise do a DNS lookup. * Create a socket * Call connect() on the socket with a 32 bit or 128 bit address. The problem is that, on some platforms, gethostbyname/getaddrinfo supports exotic IPv4 syntax that we might want browsers to reject, normalize, or give to DNS to resolve, instead. Implementations can perfectly skip gethostbyname/getaddrinfo and call connect() with a 32 bit address from the URL parser, as the spec already requires them to do for 128 bit IPv6 addresses. I don’t see a reason that IPv4 and IPv6 addresses should be parsed at different layers.
What method is invoked for DNS, then? I agree that we do not want "exotic IPv4", but Safari does not have that and the specification follows their advice with regards to this issue.
gethostbyname/getaddrinfo is used for DNS. But now I realize that we can not force it to skip IP address parsing, so "exotic IPv4" would have to be either rejected or normalized in the URL parser. Safari does not have that because their version of gethostbyname/getaddrinfo (or whatever the equivalent is) behaves differently from other platforms. Exposing that difference to web content hurts interop.
Is this advice from Safari written down somewhere online?
It's not clear how we can normalize or skip it. I would like it to do a normal DNS lookup.
Okay, so this is a description of that method: http://pubs.opengroup.org/onlinepubs/009695399/functions/getaddrinfo.html There has to be a way to do a clean DNS look though without having strings with numeric input ending up as IP addresses.
Sure. You have to manually reimplement a DNS resolver. This is, unfortunately, a Major Production if you want the result to not be broken from the user's perspective (i.e. respecting /etc/hosts and the Windows equivalents and so forth).
So the problem is that user agents that use that lower-level function end up treating http://0xc0.168.0.1/ as http://192.168.0.1/ for instance.
Sorry, that comment was meant to be longer. It seems Safari and Firefox do not normalize, but do not fail on it either. And Opera/Chrome and Internet Explorer have some kind of aggressive normalization for these kind of IP addresses (there's also http://0x1232131/ which is http://1.35.33.49/ and so on). I was kind of hoping to forbid these addresses as sneaking in IP addresses like that seems bad for security, but perhaps instead the strategy ought to be early normalization. So that as long as you use a browser-provided URL parser, you can safely compare input/output of URLs.
> Safari does not have that What are the observable differences in behavior between Safari and other browsers? Loading http://0x7f.0.0.1:8000 results in a request to 127.0.0.1:8000 in Safari 7.1. And just like in Firefox, the Host header field value was "0x7f.0.0.1:8000".
ap, I was wrong about Safari or perhaps it changed over time. Safari now appears to do the same as Firefox indeed as I said in comment 12. The difference with e.g. Chrome is that in Chrome http://0x7f.0.0.1:8000 becomes http://127.0.0.1:8000/ due to the URL parser.
Proposal: http://intertwingly.net/projects/pegurl/url.html#ipv4addr Proposed updates to the urltestdata.txt: https://github.com/rubys/url/blob/1a03e53c5c83791769d9348292c11c61b034f25d/reference-implementation/test/patchtestdata.txt Experiment with it here: http://intertwingly.net/projects/pegurl/liveview.html
It seems a test for this landed a while ago: https://github.com/w3c/web-platform-tests/pull/1402 I'll look into adding some more around e.g. normalizing half-width and full-width code points. Since the railroad diagram approach didn't reach consensus I added this as a normal parser to the specification, taking some cues from Sam's work: https://github.com/whatwg/url/commit/904374077513ac73d4e8ed2a8a76a460bb369735
+1