2457 2005-11-04 16:39:20 +0000 Rules for URI encoding don't match RFC 3986/3987 2006-11-16 18:48:29 +0000 1 1 1 Unclassified XPath / XQuery / XSLT Functions and Operators 1.0 Candidate Recommendation PC Windows XP CLOSED FIXED P2 normal --- 1 mike ashok.malhotra public-qt-comments oldest_to_newest 7041 0 mike 2005-11-04 16:39:20 +0000 I hate bringing up this old chestnut again, but I have a nasty feeling we've got it wrong. Currently encode-for-uri() does NOT escape a "#" sign. This seems contrary to the purpose of the function, and inconsistent with the treatment of other characters. In RFC 3986 (2.2 reserved characters), we read: reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" The spec goes on to say: URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component. [This basically means that sub-delims are delimiters in some URI schemes/contexts, and not in others.] encode-for-uri() escapes all characters except A-Z, a-z, 0-9, and "#" "-" "_" "." "!" "~" "*" "'" "(" ")" This seems to come largely from RFC2396, which has (in section 2.2) unreserved = alphanum | mark mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" the only difference being the "#". The concept of "mark" seems to have disappeared in 3986. RFC 2396 then says (2.4): Data must be escaped if it does not have a representation using an unreserved character So both RFCs agree that "#", if it is not used with its special purpose as a delimiter, must be escaped. So why don't we escape it? The history of this is so tortuous that I really don't want to research it. I think a lot of it has to do with the fact that RFC 2396 handled it badly. 3986 seems much clearer, and my recommendation would be that we not only add "#" to the list of characters that are escaped, but that we do exactly what 3986 says, which is to escape all characters in the "reserved" list (both gen-delims and sub-delims) above. Procedurally, as RFC 3986 is dated January 2005, I think we can reasonably argue that it was an oversight not to bring our specs into line with it for the last call, and that it's reasonable to rectify the situation during CR. Other WGs have been fairly interested in this question so we'll obviously need to consult. Note: I was alerted to the oddity of the current spec by the test results for fn-encode-for-uri1args-1 and related tests. The Saxon implementation currently does escape "#". Having looked at this, we should then look at the iri-to-uri() list as well. It's hard to see any relationship between that list of characters and RFC3986 either. In fact, the statement: All characters are escaped other than the lower case letters a-z, the upper case letters A-Z, the digits 0-9, the NUMBER SIGN "#" and HYPHEN-MINUS ("-"), LOW LINE ("_"), FULL STOP ".", EXCLAMATION MARK "!", TILDE "~", ASTERISK "*", APOSTROPHE "'", LEFT PARENTHESIS "(", and RIGHT PARENTHESIS ")", SEMICOLON ";", SOLIDUS "/", QUESTION MARK "?", COLON ":", COMMERCIAL AT "@", AMPERSAND "&", EQUALS SIGN "=", PLUS SIGN "+", DOLLAR SIGN "$", COMMA ",", LEFT SQUARE BRACKET "[", RIGHT SQUARE BRACKET "]", and the PERCENT SIGN "%". seems equivalent to saying "escape all non-ASCII characters plus (", <, >, `, \, ^, and |) - which is a pretty bizarre list. We would expect to find the spec for iri-to-uri() in RFC3987, and sure enough, it's there. What it says is that every character in "ucschar" or "iprivate" must be %-encoded. That's defined like this: ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD / %xD0000-DFFFD / %xE1000-EFFFD iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD which is pretty much the same as saying "non-ASCII characters" (and thus overlaps rather with escape-html-uri()). Since we now have a function called iri-to-uri(), it would seem that it ought to do what the IRI spec says. Previously raised internally at http://lists.w3.org/Archives/Member/w3c-xsl-query/2005Oct/0044.html See also subsequent thread. 7425 1 Norman.Walsh 2005-12-13 18:45:07 +0000 Escaping the # seems like the right thing; see http://lists.w3.org/Archives/Public/www-tag/2005Dec/0040 7856 2 Norman.Walsh 2006-01-17 15:57:06 +0000 My proposal per ACTION A-282-01 fn:encode-for-uri fn:encode-for-uri($uri-part as xs:string?) as xs:string Summary: This function encodes reserved characters in an xs:string that is intended to be used in the path segment of a URI. It is invertible but not idempotent. This function applies the URI escaping rules defined in section 2 of [RFC 3986] to the string supplied as $uri-part. The effect of the function is to escape reserved characters. Each such character in the string is replaced with its percent-encoded form as described in [RFC 3986]. If $uri-part is the empty sequence, returns the zero-length string. All characters are escaped except those identified as "unreserved" by [RFC 3986], that is the upper- and lower-case letters A-Z, the digits 0-9, HYPHEN-MINUS ("-"), LOW LINE ("_"), FULL STOP ".", and TILDE "~". Note that this function escapes URI delimiters and therefore cannot be used indiscriminately to encode "invalid" characters in a path segment. Since [RFC 3986] recommends that, for consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings, this function must always generate hexadecimal values using the upper-case letters A-F. Examples * fn:encode-for-uri("http://www.example.com/00/Weather/CA/Los%20Angeles#ocean") returns "http%3A%2F%2Fwww.example.com%2F00%2FWeather%2FCA%2FLos%2520Angeles#ocean". This is probably not what the user intended because all of the delimiters have been encoded. * concat("http://www.example.com/", encode-for-uri("~bébé")) returns "http://www.example.com/~b%C3%A9b%C3%A9". * concat("http://www.example.com/", encode-for-uri("100% organic")) returns "http://www.example.com/100%25%20organic". fn:iri-to-uri fn:iri-to-uri($uri-part as xs:string?) as xs:string Summary: This function converts an xs:string containing an IRI into a URI according to the rules spelled out in Section 3.1 of [RFC 3987]. It is idempotent but not invertible. If $uri-part is the empty sequence, returns the zero-length string. Since [RFC 3986] recommends that, for consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings, this function must always generate hexadecimal values using the upper-case letters A-F. Note: Since this function does not escape the PERCENT SIGN "%" and this character is not allowed in data within a URI, users wishing to convert character strings, such as file names, that include "%" to a URI should manually escape "%" by replacing it with "%25". 8042 3 ashok.malhotra 2006-01-30 16:30:04 +0000 As decided by the joint WGs, changed the description of fn:encode-for-uri and fn:iri-to-uri based on the wording supplied by Norman Walsh. 9201 4 mike 2006-04-13 09:00:22 +0000 Norm Walsh's proposal in comment #2 includes the example: * fn:encode-for-uri("http://www.example.com/00/Weather/CA/Los%20Angeles#ocean") returns "http%3A%2F%2Fwww.example.com%2F00%2FWeather%2FCA%2FLos%2520Angeles#ocean". This is probably not what the user intended because all of the delimiters have been encoded. which rather contradicts the intent, clearly stated in comment #1: "Escaping the # seems like the right thing" I think this is just an editorial error in the proposed example, rather than anything deeper. Pointed out on the Saxon list by Kevin Rodgers: https://sourceforge.net/forum/message.php?msg_id=3683924 9302 5 ashok.malhotra 2006-04-18 22:26:43 +0000 On the 2006 April 18 telcon the joint WG agreed to correct the first example in fn:encode-for-uri by esacping the # mark.