This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Step 6 of section 2.6.3, resolving URLs <http://www.w3.org/TR/html5/urls.html#resolving-urls> requires that the ToASCII algorithm of IDNA 2003 (RFC 3490, http://tools.ietf.org/html/rfc3490) be called with the UseSTD3ASCIIRules flag set. The UseSTD3ASCIIRules flag says that the rules specified in STD3 (RFC 1122) for host names should be enforced. This means that host name labels are restricted to an alphanumeric character, followed by alphanumeric and hyphens, followed by an alphanumeric character. Host names in the wild can contain underscores, and most software seems to cope just fine with them. I discovered this problem when someone had problems submitting such a URL to Reddit <http://www.reddit.com/r/boston/comments/neb4h/boston_hockey_player_didnt_get_kicked_from_the/c38emxu>, which enforces the host name restriction. However, none of the browsers I tried (Firefox, Chrome, Safari, and Opera, all on Mac OS X 10.7.2) implemented this restriction; that host name works fine in all of them. I've checked the Alexa Top Million Sites <http://s3.amazonaws.com/alexa-static/top-1m.csv.zip>, and found over a dozen hosts that contain underscores in their names. I would recommend relaxing the UseSTD3ASCIIRules restriction, by a willful violation of RFC 3490 (or its successor, RFC 5891 <http://tools.ietf.org/html/rfc5891>, if that is ever used), to allow the underscore in the same places that a hyphen is allowed.
Maybe underscore character should at least be allowed in _sub_domain names (foo_bar.example.com) since such subdomains, indeed, do work in real world. Domain registrators usually do not allow to use underscore in second-level domains (foo_bar.com), but _sub_domains are _not_ subject for this restriction since they are created by second-level-domain _owner_ (which includes transparent internal redirection by web-server on the fly without even assigning DNS-record to each subdomain severally), not registrator at all.
(In reply to comment #0) > Host names in the wild can contain underscores, and most software seems to cope > just fine with them. I discovered this problem when someone had problems > submitting such a URL to Reddit > <http://www.reddit.com/r/boston/comments/neb4h/boston_hockey_player_didnt_get_kicked_from_the/c38emxu> there's no underscore in the hostname of this url
(In reply to comment #1) > Maybe underscore character should at least be allowed in _sub_domain names > (foo_bar.example.com) since such subdomains, indeed, do work in real world. > > Domain registrators usually do not allow to use underscore in second-level > domains (foo_bar.com), but _sub_domains are _not_ subject for this restriction > since they are created by second-level-domain _owner_ (which includes > transparent internal redirection by web-server on the fly without even > assigning DNS-record to each subdomain severally), not registrator at all. Maybe. If you check the Alexa top million sites CSV, you see several second level domains with underscores. However, none of them actually resolve, as far as I can tell, so they are most likely just junk data in Alexa's dataset. Subdomains do actually work in practice, however. I have yet to see a working second level domain that includes an underscore. I am not sure that this restriction should be specified in HTML5, however. If it's merely a registrar policy, it could change in the future. Also, distinguishing between registered domains and subdomains is hard, given cases like .co.uk. I would just as soon leave that part up to the registrars.
(In reply to comment #2) > (In reply to comment #0) > > Host names in the wild can contain underscores, and most software seems to cope > > just fine with them. I discovered this problem when someone had problems > > submitting such a URL to Reddit > > <http://www.reddit.com/r/boston/comments/neb4h/boston_hockey_player_didnt_get_kicked_from_the/c38emxu> > > there's no underscore in the hostname of this url That was a link to the discussion about the URL with the underscore in the host. If you follow that link, and then the story it points to, you will see the URL under discussion: http://neshl_mboston.stats.pointstreak.com/playerpage.html?playerid=5186057&seasonid=7647
Is the underscore the only additional character? I think there might be others too. E.g. some browsers support ";" (I read) and probably more as long as the DNS entry is there... The real problem with host names I have at the moment is figuring out which algorithm is actually run on them before the result is passed to the network layer.
FWIW, the current plan is to require implementations to support any character in the ASCII range and not put any limitations there. We might encourage/require that people do not use the full range though as it seems not all systems work the same way.
*** This bug has been marked as a duplicate of bug 18910 ***