<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>15254</bug_id>
          
          <creation_ts>2011-12-17 08:17:20 +0000</creation_ts>
          <short_desc>Don&apos;t forbid underscore in host names in URLs</short_desc>
          <delta_ts>2012-12-21 14:30:08 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>URL</component>
          <version>unspecified</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Windows 3.1</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>DUPLICATE</resolution>
          <dup_id>18910</dup_id>
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Brian Campbell">lambda</reporter>
          <assigned_to name="Anne">annevk</assigned_to>
          <cc>annevk</cc>
    
    <cc>erik.arvidsson</cc>
    
    <cc>glenn</cc>
    
    <cc>ian</cc>
    
    <cc>mike</cc>
    
    <cc>mtanalin</cc>
    
    <cc>public-html-admin</cc>
    
    <cc>public-html-wg-issue-tracking</cc>
    
    <cc>public-webapps</cc>
          
          <qa_contact>sideshowbarker+urlspec</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>61736</commentid>
    <comment_count>0</comment_count>
    <who name="Brian Campbell">lambda</who>
    <bug_when>2011-12-17 08:17:20 +0000</bug_when>
    <thetext>Step 6 of section 2.6.3, resolving URLs &lt;http://www.w3.org/TR/html5/urls.html#resolving-urls&gt; requires that the ToASCII algorithm of IDNA 2003 (RFC 3490, http://tools.ietf.org/html/rfc3490) be called with the UseSTD3ASCIIRules flag set. The UseSTD3ASCIIRules flag says that the rules specified in STD3 (RFC 1122) for host names should be enforced. This means that host name labels are restricted to an alphanumeric character, followed by alphanumeric and hyphens, followed by an alphanumeric character.

Host names in the wild can contain underscores, and most software seems to cope just fine with them. I discovered this problem when someone had problems submitting such a URL to Reddit &lt;http://www.reddit.com/r/boston/comments/neb4h/boston_hockey_player_didnt_get_kicked_from_the/c38emxu&gt;, which enforces the host name restriction. However, none of the browsers I tried (Firefox, Chrome, Safari, and Opera, all on Mac OS X 10.7.2) implemented this restriction; that host name works fine in all of them. I&apos;ve checked the Alexa Top Million Sites &lt;http://s3.amazonaws.com/alexa-static/top-1m.csv.zip&gt;, and found over a dozen hosts that contain underscores in their names.

I would recommend relaxing the UseSTD3ASCIIRules restriction, by a willful violation of RFC 3490 (or its successor, RFC 5891 &lt;http://tools.ietf.org/html/rfc5891&gt;, if that is ever used), to allow the underscore in the same places that a hyphen is allowed.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>61747</commentid>
    <comment_count>1</comment_count>
    <who name="Marat Tanalin | tanalin.com">mtanalin</who>
    <bug_when>2011-12-17 13:52:04 +0000</bug_when>
    <thetext>Maybe underscore character should at least be allowed in _sub_domain names (foo_bar.example.com) since such subdomains, indeed, do work in real world.

Domain registrators usually do not allow to use underscore in second-level domains (foo_bar.com), but _sub_domains are _not_ subject for this restriction since they are created by second-level-domain _owner_ (which includes transparent internal redirection by web-server on the fly without even assigning DNS-record to each subdomain severally), not registrator at all.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>61748</commentid>
    <comment_count>2</comment_count>
    <who name="Glenn Adams">glenn</who>
    <bug_when>2011-12-17 14:40:13 +0000</bug_when>
    <thetext>(In reply to comment #0)
&gt; Host names in the wild can contain underscores, and most software seems to cope
&gt; just fine with them. I discovered this problem when someone had problems
&gt; submitting such a URL to Reddit
&gt; &lt;http://www.reddit.com/r/boston/comments/neb4h/boston_hockey_player_didnt_get_kicked_from_the/c38emxu&gt;

there&apos;s no underscore in the hostname of this url</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>61751</commentid>
    <comment_count>3</comment_count>
    <who name="Brian Campbell">lambda</who>
    <bug_when>2011-12-17 16:14:04 +0000</bug_when>
    <thetext>(In reply to comment #1)
&gt; Maybe underscore character should at least be allowed in _sub_domain names
&gt; (foo_bar.example.com) since such subdomains, indeed, do work in real world.
&gt; 
&gt; Domain registrators usually do not allow to use underscore in second-level
&gt; domains (foo_bar.com), but _sub_domains are _not_ subject for this restriction
&gt; since they are created by second-level-domain _owner_ (which includes
&gt; transparent internal redirection by web-server on the fly without even
&gt; assigning DNS-record to each subdomain severally), not registrator at all.

Maybe. If you check the Alexa top million sites CSV, you see several second level domains with underscores. However, none of them actually resolve, as far as I can tell, so they are most likely just junk data in Alexa&apos;s dataset. Subdomains do actually work in practice, however. I have yet to see a working second level domain that includes an underscore.

I am not sure that this restriction should be specified in HTML5, however. If it&apos;s merely a registrar policy, it could change in the future. Also, distinguishing between registered domains and subdomains is hard, given cases like .co.uk. I would just as soon leave that part up to the registrars.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>61752</commentid>
    <comment_count>4</comment_count>
    <who name="Brian Campbell">lambda</who>
    <bug_when>2011-12-17 16:15:53 +0000</bug_when>
    <thetext>(In reply to comment #2)
&gt; (In reply to comment #0)
&gt; &gt; Host names in the wild can contain underscores, and most software seems to cope
&gt; &gt; just fine with them. I discovered this problem when someone had problems
&gt; &gt; submitting such a URL to Reddit
&gt; &gt; &lt;http://www.reddit.com/r/boston/comments/neb4h/boston_hockey_player_didnt_get_kicked_from_the/c38emxu&gt;
&gt; 
&gt; there&apos;s no underscore in the hostname of this url

That was a link to the discussion about the URL with the underscore in the host. If you follow that link, and then the story it points to, you will see the URL under discussion:

http://neshl_mboston.stats.pointstreak.com/playerpage.html?playerid=5186057&amp;seasonid=7647</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>74836</commentid>
    <comment_count>5</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2012-09-28 20:16:17 +0000</bug_when>
    <thetext>Is the underscore the only additional character? I think there might be others too. E.g. some browsers support &quot;;&quot; (I read) and probably more as long as the DNS entry is there... 

The real problem with host names I have at the moment is figuring out which algorithm is actually run on them before the result is passed to the network layer.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>78734</commentid>
    <comment_count>6</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2012-11-24 16:01:07 +0000</bug_when>
    <thetext>FWIW, the current plan is to require implementations to support any character in the ASCII range and not put any limitations there. We might encourage/require that people do not use the full range though as it seems not all systems work the same way.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>80480</commentid>
    <comment_count>7</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2012-12-21 14:30:08 +0000</bug_when>
    <thetext>

*** This bug has been marked as a duplicate of bug 18910 ***</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>