<?xml version="1.0"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<rfc ipr="full3978" docName="draft-connolly-href-00">
 
 
  <front>
   <title>Web addresses in HTML 5</title>

   
    
    <author fullname="Dan Connolly" initials="D" surname="Connolly"><organization>Midwest Web Sense LLC and W3C</organization><address><email>connolly@w3.org</email><uri>http://www.w3.org/People/Connolly/</uri></address></author>
    <author fullname="C. M. Sperberg-McQueen" initials="C" surname="M. Sperberg-McQueen"><organization>Black Mesa Technologies LLC</organization><address><email>cmsmcq@blackmesatech.com</email><uri>http://www.blackmesatech.com/who/cmsmcq/</uri></address></author>
   

   <date day="21" month="May" year="2009"/>

   

  <abstract><t>This specification defines the handling of Web addresses
    for Hypertext Markup Language (HTML) 5, the fifth major revision 
    of the core language of the World Wide Web. 
    In this version, special attention has been
    given to defining clear conformance criteria for user agents in an
    effort to improve interoperability.</t></abstract></front>

  

  

  
  
  

  <middle>

   <section title="Introduction" anchor="intro">
   <t>This specification defines the term <xref target="url">Web address</xref>, and defines
    various algorithms for dealing with Web addresses, because for historical
    reasons the rules defined by the URI and IRI specifications are not
    a complete description of what HTML user agents need to implement to
    be compatible with Web content.</t>
   </section>
   
   <section title="Terminology" anchor="url">

    

    <t>A <spanx style="emph">Web address</spanx> is a string used to identify a resource.</t>
    
    <t>The term "Web address" in this specification is
     used to include not only Uniform Resource Identifiers (URIs) as
     they are defined by <xref target="ref-RFC3986">RFC 3986</xref> and
     Internationalized Resource Identifiers (IRIs) as they are defined
     by <xref target="ref-RFC3987">RFC 3987</xref>, but also other strings of
     characters which can be used to identify Web resources when
     processed appropriately.  
    </t>

    <t>
    A <xref target="url">Web address</xref> is a <spanx style="emph">valid
      Web address</spanx> if at least one of the following conditions
     holds:

    <list style="symbols">

     <t>The <xref target="url">Web address</xref> is a valid URI
       reference (i.e. it matches the grammar for &lt;URI-reference&gt; given 
       in <xref target="ref-RFC3986">RFC 3986</xref>).</t>
     
     <t>The <xref target="url">Web address</xref> is a valid IRI reference 
       (i.e. it matches the grammar for &lt;IRI-reference&gt; given 
       in <xref target="ref-RFC3987">RFC 3987</xref>), and it has no
       query component. </t>
     
     <t>The <xref target="url">Web address</xref> is a valid IRI
       reference and its query component contains no unescaped non-ASCII
       characters <xref target="ref-RFC3987">[RFC3987]</xref>.</t>
     
     <t>The <xref target="url">Web address</xref> is a valid IRI
       reference and the character encoding of the
       Web address's <spanx style="verb">Document</spanx> is UTF-8 or UTF-16 <xref target="ref-RFC3987">[RFC3987]</xref>.</t>
     
    </list>
    </t>

    <t>
    A <xref target="url">Web address</xref> has an associated <spanx style="emph">URL character encoding</spanx>, determined
     as follows:
    
    <list style="hanging"><t hangText="If the Web address came from a script (e.g. as an argument to a&#10;      method)">The Web address character encoding is the script's character
       encoding.</t><t hangText="If the Web address came from a DOM node (e.g. from an element)">The node has a <spanx style="verb">Document</spanx>, and the URL character
      encoding is the document's character encoding.</t><t hangText="If the Web address had a character encoding defined when the Web address was&#10;      created or defined">The Web address character encoding is as defined.</t></list>
    </t>
   </section>
   <section title="Parsing Web addresses" anchor="parsing-urls">
    
    

    <t>
      To <spanx style="emph">parse a Web address</spanx> 
     <spanx style="emph">w</spanx> into its
     component parts, the user agent must use the following steps:
    
    <list style="numbers">
     <t>Strip leading and trailing space characters from <spanx style="emph">w</spanx>.</t>

     <t>Percent-encode all non-URI characters
       in <spanx style="emph">w</spanx>.

       


      Note:  the 2nd step will replace all of the following
      characters with a percent-encoded equivalent:
      <list style="symbols">
       <t>all characters with codepoints less than or equal to U+0020
       (i.e. the C0 control characters)</t>
       <t>all characters with codepoints greater than or equal to U+007%
       (i.e. U+007?F and all non-ASCII characters in the <spanx style="emph">w</spanx>)</t>
       <t>U+0022 double quotation mark</t>
       <t>U+0025 percent sign</t>
       <t>U+003C less-than sign</t>
       <t>U+003E greater-than sign mark</t>
       <t>U+005C reverse solidus (backslash)</t>
       <t>U+005E circumflex accent</t>
       <t>U+0060 grave accent</t>
       <t>U+007B left curly bracket</t>
       <t>U+007C vertical line</t>
       <t>U+007D right curly bracket</t>
      </list>

     </t>

     <t>If <spanx style="emph">w</spanx> begins with either of:
      <list style="symbols">
       <t>a string matching the &lt;scheme&gt; production,
       followed by "<spanx style="verb">://</spanx>"</t>
       <t>the string "//"</t>
      </list>
      then percent-encode any left or right square brackets
       (U+005B, U+005D, "<spanx style="verb">[</spanx>" and "<spanx style="verb">]</spanx>")
       following the first occurrence of "<spanx style="verb">/</spanx>",
       "<spanx style="verb">?</spanx>", or
       "<spanx style="verb">#</spanx>" which <spanx style="emph">follows</spanx> the
       first occurrence of "<spanx style="verb">//</spanx>".</t>
      <t>Otherwise, percent-encode all left and right square brackets.</t>
     
     
      <t>Percent-encode all occurrences of U+0023 (Number sign, "<spanx style="verb">#</spanx>")
       after the first.</t>
     
     
     
      
      <t>Parse <spanx style="emph">w</spanx> using the grammar in 
       <xref target="ref-RFC3986">RFC 3986</xref>.</t>
     
     
     

      <t>If <spanx style="emph">w</spanx> doesn't match the
       &lt;URI-reference&gt; production, even after the above changes are
       made to it, then parsing the Web address fails with an
       error. <xref target="ref-RFC3986">[RFC3986]</xref></t>
      
      <t>Otherwise, parsing <spanx style="emph">w</spanx> was successful; the
       components of the Web address are substrings of <spanx style="emph">w</spanx>
       defined as follows.  First, the substring of the modified <spanx style="emph">w</spanx>
       which matched a particular production in 
       <xref target="ref-RFC3986">RFC 3986</xref> is identified; then any 
       percent-encoded characters in that substring are decoded.
       The resulting string (called here the "decoded substring) 
       is one of the named components of <spanx style="emph">w</spanx>.

      As a result of percent-encoding the percent sign, any
       occurrences of percent-encoding in the Web address will be
       double-encoded at this step.

      

      <list style="hanging"><t anchor="url-scheme" hangText="&lt;scheme&gt;">The decoded substring matched by the &lt;scheme&gt; production, if any.</t><t anchor="url-host" hangText="&lt;host&gt;">The decoded substring matched by the &lt;host&gt; production, if any.</t><t anchor="url-port" hangText="&lt;port&gt;">The decoded substring matched by the &lt;port&gt; production, if any.</t><t anchor="url-hostport" hangText="&lt;hostport&gt;">If there is a &lt;scheme&gt; component and a &lt;port&gt;
	 component and the port given by the &lt;port&gt; component is
	 different than the default port defined for the protocol given by
	 the &lt;scheme&gt; component, then &lt;hostport&gt; is the
	 decoded substring that starts with the decoded substring matched by the
	 &lt;host&gt; production and ends with the decoded substring matched by the
	 &lt;port&gt; production, and includes the colon in between the
	 two. Otherwise, it is the same as the &lt;host&gt; component.</t><t anchor="url-path" hangText="&lt;path&gt;">
	
	The decoded substring matched by one of the following productions, if
	 one of them was matched:
	
	<list style="symbols"><t>&lt;path-abempty&gt;</t>
	 <t>&lt;path-absolute&gt;</t>
	 <t>&lt;path-noscheme&gt;</t>
	 <t>&lt;path-rootless&gt;</t>
	 <t>&lt;path-empty&gt;</t>
	</list></t><t anchor="url-query" hangText="&lt;query&gt;">The decoded substring matched by the &lt;query&gt; production, if any.</t><t anchor="url-fragment" hangText="&lt;fragment&gt;">The decoded substring matched by the &lt;fragment&gt; production, if any.</t><t anchor="url-host-specific" hangText="&lt;host-specific&gt;">The decoded substring that <spanx style="emph">follows</spanx> the decoded substring matched
	 by the &lt;authority&gt; production, or the whole string if the
	 &lt;authority&gt; production wasn't matched.</t></list>
      </t>
    
     
    </list>
    


    
    
    
    
  </t>

  </section>

   <section title="Resolving Web addresses" anchor="absolute-url">
    
    
    <t>
      To <spanx style="emph">resolve a Web address</spanx> to an 
     absolute Web adddress
     relative to either another absolute Web address 
     or an element,
     the user agent must use the following steps. Resolving a Web address can
     result in an error, in which case the Web address is not resolvable.
    
    <list style="numbers">
     <t>Let <spanx style="emph">w</spanx> be the Web address being
       resolved.</t>
     
     <t>Let <spanx style="emph">encoding</spanx> be the character
	encoding of the Web address.</t>
     
     <t>If <spanx style="emph">encoding</spanx> is UTF-16, then change it to
       UTF-8.</t>
     
     
      
      <t>If the algorithm was invoked with an absolute Web address
       to use as the base Web address, let <spanx style="emph">base</spanx> be that
       absolute Web address.</t>
      
      <t>Otherwise, let <spanx style="emph">base</spanx> be the <spanx style="emph">base URI of
	the element</spanx>, as defined by the XML Base specification, with
       <spanx style="emph">the base URI of the document entity</spanx> being defined as the
       document base Web address of the <spanx style="verb">Document</spanx> that
       owns the element. <xref target="ref-XMLBase">[XMLBASE]</xref></t>
      
      <t>For the purposes of the XML Base specification, user agents
       must act as if all <spanx style="verb">Document</spanx> objects represented XML
       documents.</t>
      
      <t>It is possible for <spanx style="verb">xml:base</spanx> attributes to be present
       even in HTML fragments, as such attributes can be added
       dynamically using script. (Such scripts would not be conforming,
       however, as <spanx style="verb">xml:base</spanx> attributes
       are not allowed in HTML documents.)</t>

      <t>
      The <spanx style="emph">document base Web address</spanx> of a <spanx style="verb">Document</spanx> is
       the absolute Web address obtained by running these
       substeps:
      
       
      <list style="numbers"><t>Let <spanx style="emph">fallback base url</spanx> be the
	  document's address.</t>
       
       
	
	

	

	<t>If <spanx style="emph">fallback base url</spanx> is
	 <spanx style="verb">about:blank</spanx>, and the <spanx style="verb">Document</spanx>'s
	 browsing context has a creator browsing
	  context, then let <spanx style="emph">fallback base url</spanx>
	 be the document base Web address of the creator
	  <spanx style="verb">Document</spanx> instead.</t>
	
       
       
       <t>If there is no <spanx style="verb">base</spanx> element that is both a
	 child of the <spanx style="verb">head</spanx> element and has an
	 <spanx style="verb">href</spanx> attribute, then the
	 document base Web address is <spanx style="emph">fallback base
	  url</spanx>.</t>
       
       <t>Otherwise, let <spanx style="emph">w</spanx> be the value of the
	 <spanx style="verb">href</spanx> attribute of the first
	 such element.</t>
       
       <t>Resolve <spanx style="emph">w</spanx> relative to <spanx style="emph">fallback base
	  url</spanx> (thus, the <spanx style="verb">base</spanx> <spanx style="verb">href</spanx> attribute isn't affected by
	 <spanx style="verb">xml:base</spanx> attributes).</t>
       
       <t>The document base Web address is the result of the
	 previous step if it was successful; otherwise it is <spanx style="emph">fallback base url</spanx>.</t>
       
      </list>
      </t>
     
     
     <t>Parse <spanx style="emph">w</spanx> into its component parts.</t>
     
     
      
      <t>If parsing <spanx style="emph">w</spanx> resulted in a &lt;host&gt; component, then replace the
       matching subtring of <spanx style="emph">w</spanx> with the string that
       results from expanding any sequences of percent-encoded octets in
       that component that are valid UTF-8 sequences into Unicode
       characters as defined by UTF-8.</t>
      
      <t>If any percent-encoded octets in that component are not valid
       UTF-8 sequences, then return an error and abort these steps.</t>
      
      <t>Apply the IDNA ToASCII algorithm to the matching substring,
       with both the AllowUnassigned and UseSTD3ASCIIRules flags
       set. Replace the matching substring with the result of the ToASCII
       algorithm.</t>
      
      <t>If ToASCII fails to convert one of the components of the
       string, e.g. because it is too long or because it contains invalid
       characters, then return an error and abort these steps <xref target="ref-RFC3490">[RFC3490]</xref>.</t>
      
     
     
     <t>
      If parsing <spanx style="emph">w</spanx> resulted in a &lt;path&gt; component, then replace the
       matching substring of <spanx style="emph">w</spanx> with the string that
       results from applying the following steps to each character other
       than U+0025 PERCENT SIGN (%) that doesn't match the original
       &lt;path&gt; production defined in RFC 3986:
      
      <list style="numbers"><t>Encode the character into a sequence of octets as defined by
	UTF-8.</t>
       
       <t>Replace the character with the percent-encoded form of those
	octets. <xref target="ref-RFC3986">[RFC3986]</xref></t>
       
      </list>

      For instance if <spanx style="emph">w</spanx> was "<spanx style="verb">//example.com/a^b☺c%FFd%z/?e</spanx>", then the
	&lt;path&gt; component's substring
	would be "<spanx style="verb">/a^b☺c%FFd%z/</spanx>" and the two
	characters that would have to be escaped would be "<spanx style="verb">^</spanx>" and "<spanx style="verb">☺</spanx>". The
	result after this step was applied would therefore be that <spanx style="emph">w</spanx> now had the value "<spanx style="verb">//example.com/a%5Eb%E2%98%BAc%FFd%z/?e</spanx>".
      
     </t>
     
     <t>
      
	If parsing <spanx style="emph">w</spanx> resulted in a &lt;query&gt; component, then replace the
       matching substring of <spanx style="emph">w</spanx> with the string that
       results from applying the following steps to each character other
       than U+0025 PERCENT SIGN (%) that doesn't match the original
       &lt;query&gt; production defined in RFC 3986:
      
      <list style="numbers"><t>If the character in question cannot be expressed in the
	encoding <spanx style="emph">encoding</spanx>, then replace it with a
	single 0x3F octet (an ASCII question mark) and skip the remaining
	substeps for this character.</t>
       
       <t>Encode the character into a sequence of octets as defined by
	the encoding <spanx style="emph">encoding</spanx>.</t>
       
       <t>Replace the character with the percent-encoded form of those
	octets. <xref target="ref-RFC3986">[RFC3986]</xref></t>
       
      </list>
     </t>
     
     <t>Apply the algorithm described in RFC 3986 section 5.2
       Relative Resolution, using <spanx style="emph">w</spanx> as the
       potentially relative URI reference (<spanx style="emph">R</spanx>), and
       <spanx style="emph">base</spanx> as the base URI (<spanx style="emph">Base</spanx>). <xref target="ref-RFC3986">[RFC3986]</xref></t>
     
     
      
      <t>Apply any relevant conformance criteria of RFC 3986 and RFC
       3987, returning an error and aborting these steps if
       appropriate. <xref target="ref-RFC3986">[RFC3986]</xref> <xref target="ref-RFC3987">[RFC3987]</xref></t>
      
      <t>For instance, if an absolute URI that would be
       returned by the above algorithm violates the restrictions specific
       to its scheme, e.g. a <spanx style="verb">data:</spanx> URI using the
       "<spanx style="verb">//</spanx>" server-based naming authority syntax,
       then user agents are to treat this as an error instead.</t>
      
     
     
     <t>Let <spanx style="emph">result</spanx> be the target URI (<spanx style="emph">T</spanx>) returned by the Relative Resolution
       algorithm.</t>
     
     <t>If <spanx style="emph">result</spanx> uses a scheme with a
       server-based naming authority, replace all U+005C REVERSE SOLIDUS
       (\) characters in <spanx style="emph">result</spanx> with U+002F SOLIDUS
       (/) characters.</t>
     
     <t>Return <spanx style="emph">result</spanx>.</t>
     
    </list>
    </t>

    <t>A <xref target="url">Web address</xref> is an <spanx style="emph">absolute Web address</spanx> if resolving it results in the same
     Web address without an error.
     
    </t>
    
   </section>

  </middle>



  <back>
   <references><reference anchor="ref-RFC3490" target="http//www.ietf.org/rfc/rfc3490.txt"><front><title>Internationalizing Domain Names in Applications (IDNA)</title><author fullname="P. Faltstrom, P. Hoffman, and A. Costello, "><organization/></author><date month="March" year="2003"/></front><seriesInfo name="RFC" value="3490"/></reference><reference anchor="ref-RFC3986" target="http//www.ietf.org/rfc/rfc3986.txt"><front><title>Uniform Resource Identifier (URI): Generic Syntax</title><author fullname="T. Berners-Lee, R. Fielding, and L. Masinter, "><organization/></author><date month="January" year="2005"/></front><seriesInfo name="RFC" value="3986"/></reference><reference anchor="ref-RFC3987" target="http//www.ietf.org/rfc/rfc3987.txt"><front><title>Internationalized Resource Identifiers (IRIs)</title><author fullname="M. Duerst and M. Suignard, "><organization/></author><date month="January" year="2005"/></front><seriesInfo name="RFC" value="3987"/></reference><reference anchor="ref-XMLBase" target="http//www.w3.org/TR/xmlbase/"><front><title>XML Base (Second Edition)</title><author fullname="Jonathan Marsh and Richard Tobin, ed., "><organization/></author><date month="" year=""/></front></reference></references>


  </back>
 
</rfc>

