This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
The TAG today 2005-06-16 resolved as follows. The design of escape-uri has a flaw in that it hides within one function two quite different ones. It should be split into two functions corresponding to different values of the escape-reserved flag. Possible names are as follows: encode-for-uri() takes any unicode string and returns a string which can be used as a path segment in a URI. This function is invertable, and NOT idempotennt. (Definition of the inverse function would clearly be a good idea). Its semantics are those of your function with the second argument set to TRUE. clean-uri() takes a unicode string which may contain URI syntax but (like e.g. IRI) contains invalid URI characters. Without disturbing the URI pucntuation, it encodes non-URI characters so that the result is a valid [part of a] URI in ascii. Its semantics are those of your function with the second argument set to FALSE. It is idempotent and NOT invertable.
I'm inclined to agree. Experience of using this function suggests it's very hard to remember which way to set the boolean argument, and the resulting code is not clear to the reader. I think we were over-influenced by pressure to minimize the number of functions. Perhaps suitable names might be escape-uri() and escape-uri-part(). I haven't seen use cases for an unescape-uri() function, but I agree there's an argument for it based on completeness. Michael Kay (personal response)
Tim Semantically, your design is slightly cleaner. However it would mean that you have to traverse the string twice instead of once if both transformations are required. So the question becomes which of the transformations are more common. Michael Rys (personal response)
Michael (Kay), the two functions are *quite different* as I understand it. It is not that one operates on part and the other on a whole URI. You can feed a whole or part URI to either. encode-for-uri(s) takes ANY STRING (not necessarily any relation to a URI) and encodes it as a something which can be transferred as path segment. It is an encoding in that there is a corresponding decode. if you use it twice, then you get something double-encoded. Example: Use when encoding a string argment to a HTML-form-style query. clean(s) takes a URI (or part) and just cleans it up so that any unacceptable characters are encoded in ASCII. It doesn't encode anything which is already encoded. There is no inverse function, as you can't tell what characters were not originally clean in the original string. If you use it twice, its the same as using it once. once. Example: use when encoding an IRI for transmission in HTTP. Why would you want to perform both operations? The result of encode-for-uri will allways be clean so performing a clean()n will have no effect. The result of cleaning a URI will be a clean URI whcih one may want to then encopde as a URI encoded parameter within a new query URI being built up. But that is a separate function, and should be programmed as such.
TimBL>Michael (Kay), the two functions are *quite different* as I understand it. It is not that one operates on part and the other on a whole URI. You can feed a whole or part URI to either. MHK>I don't think there is any disagreement that the two operations are different, or about the definition of the two operations, or about the reasons why we need to provide both. The question is how to package the two operations to maximize ease of use. That's why I suggested names based on the recommended use cases for the two functions. One of them is there to allow you produce a URI from a wannabe-URI (I wish we had a better name for the thing), the other is there to enable you to produce a component of a URI from an arbitrary string. We've always had to recognize that the name of a function can't encapsulate the entire semantics of what the function does; the main aim is to choose names that users will find easy to remember and distinguish. From that perspective, I don't think that "clean" is a good name, because it doesn't even hint that the function has anything to do with URIs, and it's quite unrelated to the terminology of the RFCs that describe the operation in more detail. "encode-for-uri" is a more reasonable suggestion, since it's related to the term "percent-encoding" used in RFC 3986 (replacing "escaping" in RFC 2396). But the verb "escape" to describe this operation is well-entrenched in other W3C specifications (XSLT 1.0, HTML, XLink) and therefore in the consciousness of the user community, while the verb "encode" reminds one of the unfortunate history in which the result of this operation at one time depended on the character encoding of the containing document. I don't think one can argue that "encode" is a better name for the operation because it's reversible: most escape conventions are reversible too. If we're going to insert a preposition to emphasize that it's the output that's a URI, not the input, then "as" would be a better choice than "for". TimBL:>Why would you want to perform both operations? MHK>You wouldn't want to do so, I didn't intend to suggest that you would. Michael Kay
On the joint telcon on 6/28/2005 the WGs agreed to remove the fn:escape-uri function and replace it with 2 functions called fn:encode-for-uri and fn:iri-to-uri corresponding to the behaviour of fn:escape-uri with the parameter escape-reserved set to TRUE and FALSE respectively. I would appreciate interesting examples to include in the description of these functions. Ashok Malhotra