Bug 15254 - Don't forbid underscore in host names in URLs
Don't forbid underscore in host names in URLs
Status: RESOLVED DUPLICATE of bug 18910
Product: WHATWG
Classification: Unclassified
Component: URL
unspecified
PC Windows 3.1
: P2 normal
: Unsorted
Assigned To: Anne
sideshowbarker+urlspec
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-12-17 08:17 UTC by Brian Campbell
Modified: 2012-12-21 14:30 UTC (History)
9 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Brian Campbell 2011-12-17 08:17:20 UTC
Step 6 of section 2.6.3, resolving URLs <http://www.w3.org/TR/html5/urls.html#resolving-urls> requires that the ToASCII algorithm of IDNA 2003 (RFC 3490, http://tools.ietf.org/html/rfc3490) be called with the UseSTD3ASCIIRules flag set. The UseSTD3ASCIIRules flag says that the rules specified in STD3 (RFC 1122) for host names should be enforced. This means that host name labels are restricted to an alphanumeric character, followed by alphanumeric and hyphens, followed by an alphanumeric character.

Host names in the wild can contain underscores, and most software seems to cope just fine with them. I discovered this problem when someone had problems submitting such a URL to Reddit <http://www.reddit.com/r/boston/comments/neb4h/boston_hockey_player_didnt_get_kicked_from_the/c38emxu>, which enforces the host name restriction. However, none of the browsers I tried (Firefox, Chrome, Safari, and Opera, all on Mac OS X 10.7.2) implemented this restriction; that host name works fine in all of them. I've checked the Alexa Top Million Sites <http://s3.amazonaws.com/alexa-static/top-1m.csv.zip>, and found over a dozen hosts that contain underscores in their names.

I would recommend relaxing the UseSTD3ASCIIRules restriction, by a willful violation of RFC 3490 (or its successor, RFC 5891 <http://tools.ietf.org/html/rfc5891>, if that is ever used), to allow the underscore in the same places that a hyphen is allowed.
Comment 1 Marat Tanalin | tanalin.com 2011-12-17 13:52:04 UTC
Maybe underscore character should at least be allowed in _sub_domain names (foo_bar.example.com) since such subdomains, indeed, do work in real world.

Domain registrators usually do not allow to use underscore in second-level domains (foo_bar.com), but _sub_domains are _not_ subject for this restriction since they are created by second-level-domain _owner_ (which includes transparent internal redirection by web-server on the fly without even assigning DNS-record to each subdomain severally), not registrator at all.
Comment 2 Glenn Adams 2011-12-17 14:40:13 UTC
(In reply to comment #0)
> Host names in the wild can contain underscores, and most software seems to cope
> just fine with them. I discovered this problem when someone had problems
> submitting such a URL to Reddit
> <http://www.reddit.com/r/boston/comments/neb4h/boston_hockey_player_didnt_get_kicked_from_the/c38emxu>

there's no underscore in the hostname of this url
Comment 3 Brian Campbell 2011-12-17 16:14:04 UTC
(In reply to comment #1)
> Maybe underscore character should at least be allowed in _sub_domain names
> (foo_bar.example.com) since such subdomains, indeed, do work in real world.
> 
> Domain registrators usually do not allow to use underscore in second-level
> domains (foo_bar.com), but _sub_domains are _not_ subject for this restriction
> since they are created by second-level-domain _owner_ (which includes
> transparent internal redirection by web-server on the fly without even
> assigning DNS-record to each subdomain severally), not registrator at all.

Maybe. If you check the Alexa top million sites CSV, you see several second level domains with underscores. However, none of them actually resolve, as far as I can tell, so they are most likely just junk data in Alexa's dataset. Subdomains do actually work in practice, however. I have yet to see a working second level domain that includes an underscore.

I am not sure that this restriction should be specified in HTML5, however. If it's merely a registrar policy, it could change in the future. Also, distinguishing between registered domains and subdomains is hard, given cases like .co.uk. I would just as soon leave that part up to the registrars.
Comment 4 Brian Campbell 2011-12-17 16:15:53 UTC
(In reply to comment #2)
> (In reply to comment #0)
> > Host names in the wild can contain underscores, and most software seems to cope
> > just fine with them. I discovered this problem when someone had problems
> > submitting such a URL to Reddit
> > <http://www.reddit.com/r/boston/comments/neb4h/boston_hockey_player_didnt_get_kicked_from_the/c38emxu>
> 
> there's no underscore in the hostname of this url

That was a link to the discussion about the URL with the underscore in the host. If you follow that link, and then the story it points to, you will see the URL under discussion:

http://neshl_mboston.stats.pointstreak.com/playerpage.html?playerid=5186057&seasonid=7647
Comment 5 Anne 2012-09-28 20:16:17 UTC
Is the underscore the only additional character? I think there might be others too. E.g. some browsers support ";" (I read) and probably more as long as the DNS entry is there... 

The real problem with host names I have at the moment is figuring out which algorithm is actually run on them before the result is passed to the network layer.
Comment 6 Anne 2012-11-24 16:01:07 UTC
FWIW, the current plan is to require implementations to support any character in the ASCII range and not put any limitations there. We might encourage/require that people do not use the full range though as it seems not all systems work the same way.
Comment 7 Anne 2012-12-21 14:30:08 UTC

*** This bug has been marked as a duplicate of bug 18910 ***