<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>2457</bug_id>
          
          <creation_ts>2005-11-04 16:39:20 +0000</creation_ts>
          <short_desc>Rules for URI encoding don&apos;t match RFC 3986/3987</short_desc>
          <delta_ts>2006-11-16 18:48:29 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>XPath / XQuery / XSLT</product>
          <component>Functions and Operators 1.0</component>
          <version>Candidate Recommendation</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Windows XP</op_sys>
          <bug_status>CLOSED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Michael Kay">mike</reporter>
          <assigned_to name="Ashok Malhotra">ashok.malhotra</assigned_to>
          
          
          <qa_contact name="Mailing list for public feedback on specs from XSL and XML Query WGs">public-qt-comments</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>7041</commentid>
    <comment_count>0</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2005-11-04 16:39:20 +0000</bug_when>
    <thetext>I hate bringing up this old chestnut again, but I have a nasty feeling we&apos;ve
got it wrong.

Currently encode-for-uri() does NOT escape a &quot;#&quot; sign.

This seems contrary to the purpose of the function, and inconsistent with
the treatment of other characters.

In RFC 3986 (2.2 reserved characters), we read:

      reserved    = gen-delims / sub-delims

      gen-delims  = &quot;:&quot; / &quot;/&quot; / &quot;?&quot; / &quot;#&quot; / &quot;[&quot; / &quot;]&quot; / &quot;@&quot;

      sub-delims  = &quot;!&quot; / &quot;$&quot; / &quot;&amp;&quot; / &quot;&apos;&quot; / &quot;(&quot; / &quot;)&quot;
                  / &quot;*&quot; / &quot;+&quot; / &quot;,&quot; / &quot;;&quot; / &quot;=&quot;

The spec goes on to say:

URI producing applications should percent-encode data octets that
   correspond to characters in the reserved set unless these characters
   are specifically allowed by the URI scheme to represent data in that
   component. [This basically means that sub-delims are delimiters in some
   URI schemes/contexts, and not in others.]

encode-for-uri() escapes all characters except A-Z, a-z, 0-9, and 
   
      &quot;#&quot; &quot;-&quot; &quot;_&quot; &quot;.&quot; &quot;!&quot; &quot;~&quot; &quot;*&quot; &quot;&apos;&quot; &quot;(&quot; &quot;)&quot;

This seems to come largely from RFC2396, which has (in section 2.2)

unreserved  = alphanum | mark

mark        = &quot;-&quot; | &quot;_&quot; | &quot;.&quot; | &quot;!&quot; | &quot;~&quot; | &quot;*&quot; | &quot;&apos;&quot; | &quot;(&quot; | &quot;)&quot;

the only difference being the &quot;#&quot;.

The concept of &quot;mark&quot; seems to have disappeared in 3986.

RFC 2396 then says (2.4):

Data must be escaped if it does not have a representation using an
   unreserved character

So both RFCs agree that &quot;#&quot;, if it is not used with its special purpose as a
delimiter, must be escaped.

So why don&apos;t we escape it?

The history of this is so tortuous that I really don&apos;t want to research it.
I think a lot of it has to do with the fact that RFC 2396 handled it badly.
3986 seems much clearer, and my recommendation would be that we not only add
&quot;#&quot; to the list of characters that are escaped, but that we do exactly what
3986 says, which is to escape all characters in the &quot;reserved&quot; list (both
gen-delims and sub-delims) above.

Procedurally, as RFC 3986 is dated January 2005, I think we can reasonably
argue that it was an oversight not to bring our specs into line with it for
the last call, and that it&apos;s reasonable to rectify the situation during CR.
Other WGs have been fairly interested in this question so we&apos;ll obviously
need to consult.

Note: I was alerted to the oddity of the current spec by the test results
for fn-encode-for-uri1args-1 and related tests. The Saxon implementation
currently does escape &quot;#&quot;.

Having looked at this, we should then look at the iri-to-uri() list as well.
It&apos;s hard to see any relationship between that list of characters and
RFC3986 either. In fact, the statement:

All characters are escaped other than the lower case letters a-z, the upper
case letters A-Z, the digits 0-9, the NUMBER SIGN &quot;#&quot; and HYPHEN-MINUS
(&quot;-&quot;), LOW LINE (&quot;_&quot;), FULL STOP &quot;.&quot;, EXCLAMATION MARK &quot;!&quot;, TILDE &quot;~&quot;,
ASTERISK &quot;*&quot;, APOSTROPHE &quot;&apos;&quot;, LEFT PARENTHESIS &quot;(&quot;, and RIGHT PARENTHESIS
&quot;)&quot;, SEMICOLON &quot;;&quot;, SOLIDUS &quot;/&quot;, QUESTION MARK &quot;?&quot;, COLON &quot;:&quot;, COMMERCIAL AT
&quot;@&quot;, AMPERSAND &quot;&amp;&quot;, EQUALS SIGN &quot;=&quot;, PLUS SIGN &quot;+&quot;, DOLLAR SIGN &quot;$&quot;, COMMA
&quot;,&quot;, LEFT SQUARE BRACKET &quot;[&quot;, RIGHT SQUARE BRACKET &quot;]&quot;, and the PERCENT SIGN
&quot;%&quot;.

seems equivalent to saying &quot;escape all non-ASCII characters plus (&quot;, &lt;, &gt;,
`, \, ^, and |) - which is a pretty bizarre list.

We would expect to find the spec for iri-to-uri() in RFC3987, and sure
enough, it&apos;s there. What it says is that every character in &quot;ucschar&quot; or
&quot;iprivate&quot; must be %-encoded. That&apos;s defined like this:

ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

   iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD

which is pretty much the same as saying &quot;non-ASCII characters&quot; (and thus
overlaps rather with escape-html-uri()).

Since we now have a function called iri-to-uri(), it would seem that it
ought to do what the IRI spec says.

Previously raised internally at 
http://lists.w3.org/Archives/Member/w3c-xsl-query/2005Oct/0044.html

See also subsequent thread.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>7425</commentid>
    <comment_count>1</comment_count>
    <who name="Norman Walsh">Norman.Walsh</who>
    <bug_when>2005-12-13 18:45:07 +0000</bug_when>
    <thetext>Escaping the # seems like the right thing; see
http://lists.w3.org/Archives/Public/www-tag/2005Dec/0040</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>7856</commentid>
    <comment_count>2</comment_count>
    <who name="Norman Walsh">Norman.Walsh</who>
    <bug_when>2006-01-17 15:57:06 +0000</bug_when>
    <thetext>My proposal per ACTION A-282-01

fn:encode-for-uri

  fn:encode-for-uri($uri-part as xs:string?) as xs:string

Summary: This function encodes reserved characters in an xs:string
that is intended to be used in the path segment of a URI. It is
invertible but not idempotent. This function applies the URI escaping
rules defined in section 2 of [RFC 3986] to the string supplied as
$uri-part. The effect of the function is to escape reserved
characters. Each such character in the string is replaced with its
percent-encoded form as described in [RFC 3986].

If $uri-part is the empty sequence, returns the zero-length string.

All characters are escaped except those identified as &quot;unreserved&quot; by
[RFC 3986], that is the upper- and lower-case letters A-Z, the digits
0-9, HYPHEN-MINUS (&quot;-&quot;), LOW LINE (&quot;_&quot;), FULL STOP &quot;.&quot;, and TILDE &quot;~&quot;.

Note that this function escapes URI delimiters and therefore cannot be
used indiscriminately to encode &quot;invalid&quot; characters in a path
segment.

Since [RFC 3986] recommends that, for consistency, URI producers and
normalizers should use uppercase hexadecimal digits for all
percent-encodings, this function must always generate hexadecimal
values using the upper-case letters A-F.

Examples

    * fn:encode-for-uri(&quot;http://www.example.com/00/Weather/CA/Los%20Angeles#ocean&quot;) 
      returns 
&quot;http%3A%2F%2Fwww.example.com%2F00%2FWeather%2FCA%2FLos%2520Angeles#ocean&quot;.
      This is probably not what the user intended because all of the delimiters
      have been encoded.

    * concat(&quot;http://www.example.com/&quot;, encode-for-uri(&quot;~bébé&quot;))
      returns &quot;http://www.example.com/~b%C3%A9b%C3%A9&quot;.

    * concat(&quot;http://www.example.com/&quot;, encode-for-uri(&quot;100% organic&quot;))
      returns &quot;http://www.example.com/100%25%20organic&quot;.

fn:iri-to-uri

  fn:iri-to-uri($uri-part as xs:string?) as xs:string

Summary: This function converts an xs:string containing an IRI into
a URI according to the rules spelled out in Section 3.1 of [RFC 3987].
It is idempotent but not invertible.

If $uri-part is the empty sequence, returns the zero-length string.

Since [RFC 3986] recommends that, for consistency, URI producers and
normalizers should use uppercase hexadecimal digits for all
percent-encodings, this function must always generate hexadecimal
values using the upper-case letters A-F.

Note:

  Since this function does not escape the PERCENT SIGN &quot;%&quot; and this
  character is not allowed in data within a URI, users wishing to
  convert character strings, such as file names, that include &quot;%&quot; to a
  URI should manually escape &quot;%&quot; by replacing it with &quot;%25&quot;.
</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>8042</commentid>
    <comment_count>3</comment_count>
    <who name="Ashok Malhotra">ashok.malhotra</who>
    <bug_when>2006-01-30 16:30:04 +0000</bug_when>
    <thetext>As decided by the joint WGs, changed the description of fn:encode-for-uri and
fn:iri-to-uri based on the wording supplied by Norman Walsh.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>9201</commentid>
    <comment_count>4</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2006-04-13 09:00:22 +0000</bug_when>
    <thetext>Norm Walsh&apos;s proposal in comment #2 includes the example:

    * fn:encode-for-uri(&quot;http://www.example.com/00/Weather/CA/Los%20Angeles#ocean&quot;) 
      returns 
&quot;http%3A%2F%2Fwww.example.com%2F00%2FWeather%2FCA%2FLos%2520Angeles#ocean&quot;.
      This is probably not what the user intended because all of the delimiters
      have been encoded.

which rather contradicts the intent, clearly stated in comment #1:

&quot;Escaping the # seems like the right thing&quot;

I think this is just an editorial error in the proposed example, rather than anything deeper.

Pointed out on the Saxon list by Kevin Rodgers: https://sourceforge.net/forum/message.php?msg_id=3683924</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>9302</commentid>
    <comment_count>5</comment_count>
    <who name="Ashok Malhotra">ashok.malhotra</who>
    <bug_when>2006-04-18 22:26:43 +0000</bug_when>
    <thetext>On the 2006 April 18 telcon the joint WG agreed to correct the first example in fn:encode-for-uri by esacping the # mark.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>