26431 – Define IPv4 parsing

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 26431 - Define IPv4 parsing

Summary: Define IPv4 parsing

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	URL (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+urlspec

URL:
Whiteboard:
Keywords:

Depends on:	25946
Blocks:
	Show dependency tree / graph

Reported:	2014-07-24 23:02 UTC by Simon Sapin
Modified:	2015-07-01 14:32 UTC (History)
CC List:	7 users (show)

See Also:

Attachments

Description Simon Sapin 2014-07-24 23:02:01 UTC

Right now, IPv4 addresses (not IPv6 with an IPv4-like suffix) are considered to be just like DNS domain names.

It turns out there are many "interesting" variations of syntax for IPv4 (hexadecimal or octal numbers, combining the last 2~4 parts in one 16~24 bit number, ...)

https://github.com/w3c/web-platform-tests/issues/1104

The spec should define what to do: what syntax is accepted or make the URL invalid, what normalization happens if any, what is considered an IPv4 address or a (DNS) domain name.

Right now a "host" in the spec is either a "domain" (a string) or an "IPv6 address" (eight 16 bit integers). I suggest adding a variant so it can also be an "IPv4 address" (four 8 bit integers).

Comment 1 Simon Sapin 2014-07-24 23:06:31 UTC

Beyond the usual "undefined things hurt interop", this is probably a security issue where non-equivalent URLs looks like each other (phishing), or equivalent URLs look different (by passing URL-as-string-based access control). I don’t know what the right thing to do is, though.

Comment 2 Simon Sapin 2014-07-25 07:54:04 UTC

Also: does Nameprep / IDNA normalization (such as full-width CJK digits to ASCII digits) apply to IPv4 addresses? If so, should it apply to IPv6 too?

https://github.com/w3c/web-platform-tests/blob/2712b5611a4e048e04a7dc814a7a31413d2d367a/url/urltestdata.txt#L315-L317

Comment 3 Anne 2014-07-28 13:31:15 UTC

So the question here is whether parsing a URL should return an object that's an IPv4 address or have IPv4 stored in the same way as a domain and let the network deal with it.

From what I'm told currently the network layer deals with, despite that some older RFCs suggests it should also be distinguished at the URL layer.

Comment 4 Simon Sapin 2014-07-28 16:20:44 UTC

As far as I understand, in C/C++ on a POSIX system, applications that want to open a TCP connection typically:

* Call gethostbyname() or getaddrinfo() (provided by libc) with the "host" as a string, and return a 32 bit or 128 bit address. These functions simply do string parsing if the input looks like an IPv4 or IPv6 address, otherwise do a DNS lookup.
* Create a socket
* Call connect() on the socket with a 32 bit or 128 bit address.

The problem is that, on some platforms, gethostbyname/getaddrinfo supports exotic IPv4 syntax that we might want browsers to reject, normalize, or give to DNS to resolve, instead.

Implementations can perfectly skip gethostbyname/getaddrinfo and call connect() with a 32 bit address from the URL parser, as the spec already requires them to do for 128 bit IPv6 addresses.

I don’t see a reason that IPv4 and IPv6 addresses should be parsed at different layers.

Comment 5 Anne 2014-07-28 16:30:33 UTC

What method is invoked for DNS, then? I agree that we do not want "exotic IPv4", but Safari does not have that and the specification follows their advice with regards to this issue.

Comment 6 Simon Sapin 2014-07-28 16:43:12 UTC

gethostbyname/getaddrinfo is used for DNS. But now I realize that we can not force it to skip IP address parsing, so "exotic IPv4" would have to be either rejected or normalized in the URL parser.

Safari does not have that because their version of gethostbyname/getaddrinfo (or whatever the equivalent is) behaves differently from other platforms. Exposing that difference to web content hurts interop.

Comment 7 Simon Sapin 2014-07-28 16:44:20 UTC

Is this advice from Safari written down somewhere online?

Comment 8 Anne 2014-07-28 16:59:08 UTC

It's not clear how we can normalize or skip it. I would like it to do a normal DNS lookup.

Comment 9 Anne 2014-10-14 16:58:13 UTC

Okay, so this is a description of that method:
http://pubs.opengroup.org/onlinepubs/009695399/functions/getaddrinfo.html

There has to be a way to do a clean DNS look though without having strings with numeric input ending up as IP addresses.

Comment 10 Boris Zbarsky 2014-10-14 19:44:42 UTC

Sure.  You have to manually reimplement a DNS resolver.  This is, unfortunately, a Major Production if you want the result to not be broken from the user's perspective (i.e. respecting /etc/hosts and the Windows equivalents and so forth).

Comment 11 Anne 2014-10-15 07:00:24 UTC

So the problem is that user agents that use that lower-level function end up treating

  http://0xc0.168.0.1/

as

  http://192.168.0.1/

for instance.

Comment 12 Anne 2014-10-15 07:05:47 UTC

Sorry, that comment was meant to be longer.

It seems Safari and Firefox do not normalize, but do not fail on it either. And Opera/Chrome and Internet Explorer have some kind of aggressive normalization for these kind of IP addresses (there's also http://0x1232131/ which is http://1.35.33.49/ and so on).

I was kind of hoping to forbid these addresses as sneaking in IP addresses like that seems bad for security, but perhaps instead the strategy ought to be early normalization. So that as long as you use a browser-provided URL parser, you can safely compare input/output of URLs.

Comment 13 Alexey Proskuryakov 2014-10-15 17:12:30 UTC

> Safari does not have that

What are the observable differences in behavior between Safari and other browsers? Loading http://0x7f.0.0.1:8000 results in a request to 127.0.0.1:8000 in Safari 7.1.

And just like in Firefox, the Host header field value was "0x7f.0.0.1:8000".

Comment 14 Anne 2014-10-15 17:22:16 UTC

ap, I was wrong about Safari or perhaps it changed over time. Safari now appears to do the same as Firefox indeed as I said in comment 12. The difference with e.g. Chrome is that in Chrome http://0x7f.0.0.1:8000 becomes http://127.0.0.1:8000/ due to the URL parser.

Comment 15 Sam Ruby 2014-11-05 12:28:32 UTC

Proposal: http://intertwingly.net/projects/pegurl/url.html#ipv4addr

Proposed updates to the urltestdata.txt: https://github.com/rubys/url/blob/1a03e53c5c83791769d9348292c11c61b034f25d/reference-implementation/test/patchtestdata.txt

Experiment with it here: http://intertwingly.net/projects/pegurl/liveview.html

Comment 16 Anne 2015-07-01 13:54:40 UTC

It seems a test for this landed a while ago: https://github.com/w3c/web-platform-tests/pull/1402 I'll look into adding some more around e.g. normalizing half-width and full-width code points.

Since the railroad diagram approach didn't reach consensus I added this as a normal parser to the specification, taking some cues from Sam's work:

https://github.com/whatwg/url/commit/904374077513ac73d4e8ed2a8a76a460bb369735

Comment 17 Sam Ruby 2015-07-01 14:32:29 UTC

+1