2457 – Rules for URI encoding don't match RFC 3986/3987

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 2457 - Rules for URI encoding don't match RFC 3986/3987

Summary: Rules for URI encoding don't match RFC 3986/3987

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Functions and Operators 1.0 (show other bugs)
Version:	Candidate Recommendation
Hardware:	PC Windows XP

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ashok Malhotra
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-11-04 16:39 UTC by Michael Kay
Modified:	2006-11-16 18:48 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Michael Kay 2005-11-04 16:39:20 UTC

I hate bringing up this old chestnut again, but I have a nasty feeling we've
got it wrong.

Currently encode-for-uri() does NOT escape a "#" sign.

This seems contrary to the purpose of the function, and inconsistent with
the treatment of other characters.

In RFC 3986 (2.2 reserved characters), we read:

      reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

The spec goes on to say:

URI producing applications should percent-encode data octets that
   correspond to characters in the reserved set unless these characters
   are specifically allowed by the URI scheme to represent data in that
   component. [This basically means that sub-delims are delimiters in some
   URI schemes/contexts, and not in others.]

encode-for-uri() escapes all characters except A-Z, a-z, 0-9, and 
   
      "#" "-" "_" "." "!" "~" "*" "'" "(" ")"

This seems to come largely from RFC2396, which has (in section 2.2)

unreserved  = alphanum | mark

mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

the only difference being the "#".

The concept of "mark" seems to have disappeared in 3986.

RFC 2396 then says (2.4):

Data must be escaped if it does not have a representation using an
   unreserved character

So both RFCs agree that "#", if it is not used with its special purpose as a
delimiter, must be escaped.

So why don't we escape it?

The history of this is so tortuous that I really don't want to research it.
I think a lot of it has to do with the fact that RFC 2396 handled it badly.
3986 seems much clearer, and my recommendation would be that we not only add
"#" to the list of characters that are escaped, but that we do exactly what
3986 says, which is to escape all characters in the "reserved" list (both
gen-delims and sub-delims) above.

Procedurally, as RFC 3986 is dated January 2005, I think we can reasonably
argue that it was an oversight not to bring our specs into line with it for
the last call, and that it's reasonable to rectify the situation during CR.
Other WGs have been fairly interested in this question so we'll obviously
need to consult.

Note: I was alerted to the oddity of the current spec by the test results
for fn-encode-for-uri1args-1 and related tests. The Saxon implementation
currently does escape "#".

Having looked at this, we should then look at the iri-to-uri() list as well.
It's hard to see any relationship between that list of characters and
RFC3986 either. In fact, the statement:

All characters are escaped other than the lower case letters a-z, the upper
case letters A-Z, the digits 0-9, the NUMBER SIGN "#" and HYPHEN-MINUS
("-"), LOW LINE ("_"), FULL STOP ".", EXCLAMATION MARK "!", TILDE "~",
ASTERISK "*", APOSTROPHE "'", LEFT PARENTHESIS "(", and RIGHT PARENTHESIS
")", SEMICOLON ";", SOLIDUS "/", QUESTION MARK "?", COLON ":", COMMERCIAL AT
"@", AMPERSAND "&", EQUALS SIGN "=", PLUS SIGN "+", DOLLAR SIGN "$", COMMA
",", LEFT SQUARE BRACKET "[", RIGHT SQUARE BRACKET "]", and the PERCENT SIGN
"%".

seems equivalent to saying "escape all non-ASCII characters plus (", <, >,
`, \, ^, and |) - which is a pretty bizarre list.

We would expect to find the spec for iri-to-uri() in RFC3987, and sure
enough, it's there. What it says is that every character in "ucschar" or
"iprivate" must be %-encoded. That's defined like this:

ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

   iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD

which is pretty much the same as saying "non-ASCII characters" (and thus
overlaps rather with escape-html-uri()).

Since we now have a function called iri-to-uri(), it would seem that it
ought to do what the IRI spec says.

Previously raised internally at 
http://lists.w3.org/Archives/Member/w3c-xsl-query/2005Oct/0044.html

See also subsequent thread.

Comment 1 Norman Walsh 2005-12-13 18:45:07 UTC

Escaping the # seems like the right thing; see
http://lists.w3.org/Archives/Public/www-tag/2005Dec/0040

Comment 2 Norman Walsh 2006-01-17 15:57:06 UTC

My proposal per ACTION A-282-01

fn:encode-for-uri

  fn:encode-for-uri($uri-part as xs:string?) as xs:string

Summary: This function encodes reserved characters in an xs:string
that is intended to be used in the path segment of a URI. It is
invertible but not idempotent. This function applies the URI escaping
rules defined in section 2 of [RFC 3986] to the string supplied as
$uri-part. The effect of the function is to escape reserved
characters. Each such character in the string is replaced with its
percent-encoded form as described in [RFC 3986].

If $uri-part is the empty sequence, returns the zero-length string.

All characters are escaped except those identified as "unreserved" by
[RFC 3986], that is the upper- and lower-case letters A-Z, the digits
0-9, HYPHEN-MINUS ("-"), LOW LINE ("_"), FULL STOP ".", and TILDE "~".

Note that this function escapes URI delimiters and therefore cannot be
used indiscriminately to encode "invalid" characters in a path
segment.

Since [RFC 3986] recommends that, for consistency, URI producers and
normalizers should use uppercase hexadecimal digits for all
percent-encodings, this function must always generate hexadecimal
values using the upper-case letters A-F.

Examples

    * fn:encode-for-uri("http://www.example.com/00/Weather/CA/Los%20Angeles#ocean") 
      returns 
"http%3A%2F%2Fwww.example.com%2F00%2FWeather%2FCA%2FLos%2520Angeles#ocean".
      This is probably not what the user intended because all of the delimiters
      have been encoded.

    * concat("http://www.example.com/", encode-for-uri("~bébé"))
      returns "http://www.example.com/~b%C3%A9b%C3%A9".

    * concat("http://www.example.com/", encode-for-uri("100% organic"))
      returns "http://www.example.com/100%25%20organic".

fn:iri-to-uri

  fn:iri-to-uri($uri-part as xs:string?) as xs:string

Summary: This function converts an xs:string containing an IRI into
a URI according to the rules spelled out in Section 3.1 of [RFC 3987].
It is idempotent but not invertible.

If $uri-part is the empty sequence, returns the zero-length string.

Since [RFC 3986] recommends that, for consistency, URI producers and
normalizers should use uppercase hexadecimal digits for all
percent-encodings, this function must always generate hexadecimal
values using the upper-case letters A-F.

Note:

  Since this function does not escape the PERCENT SIGN "%" and this
  character is not allowed in data within a URI, users wishing to
  convert character strings, such as file names, that include "%" to a
  URI should manually escape "%" by replacing it with "%25".

Comment 3 Ashok Malhotra 2006-01-30 16:30:04 UTC

As decided by the joint WGs, changed the description of fn:encode-for-uri and
fn:iri-to-uri based on the wording supplied by Norman Walsh.

Comment 4 Michael Kay 2006-04-13 09:00:22 UTC

Norm Walsh's proposal in comment #2 includes the example:

    * fn:encode-for-uri("http://www.example.com/00/Weather/CA/Los%20Angeles#ocean") 
      returns 
"http%3A%2F%2Fwww.example.com%2F00%2FWeather%2FCA%2FLos%2520Angeles#ocean".
      This is probably not what the user intended because all of the delimiters
      have been encoded.

which rather contradicts the intent, clearly stated in comment #1:

"Escaping the # seems like the right thing"

I think this is just an editorial error in the proposed example, rather than anything deeper.

Pointed out on the Saxon list by Kevin Rodgers: https://sourceforge.net/forum/message.php?msg_id=3683924

Comment 5 Ashok Malhotra 2006-04-18 22:26:43 UTC

On the 2006 April 18 telcon the joint WG agreed to correct the first example in fn:encode-for-uri by esacping the # mark.