1502 – [F&O] escape-uri encompasses & s/b split into 2 distinct functions

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 1502 - [F&O] escape-uri encompasses & s/b split into 2 distinct functions

Summary: [F&O] escape-uri encompasses & s/b split into 2 distinct functions

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Functions and Operators 1.0 (show other bugs)
Version:	Last Call drafts
Hardware:	Macintosh All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ashok Malhotra
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:	http://www.w3.org/TR/2005/WD-xpath-fu...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-06-16 18:09 UTC by Tim Berners-Lee
Modified:	2005-09-29 11:29 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Tim Berners-Lee 2005-06-16 18:09:49 UTC

The TAG today 2005-06-16 resolved as follows.

The design of escape-uri has a flaw in that it hides within one function two quite different ones.  It 
should be split into two functions corresponding to different values of the escape-reserved flag. 
Possible names are as follows:

encode-for-uri()  takes any unicode string and returns a string which can be used as a path segment in 
a URI.  This function is invertable, and NOT idempotennt.  (Definition of the inverse function would 
clearly be a good idea).  Its semantics are those of your function with the second argument set to TRUE.

clean-uri()  takes a unicode string which may contain URI syntax but (like e.g. IRI) contains invalid URI 
characters. Without disturbing the URI pucntuation, it encodes non-URI characters so that the result is a 
valid [part of a] URI in ascii.  Its semantics are those of your function with the second argument set to 
FALSE.  It is idempotent and NOT invertable.

Comment 1 Michael Kay 2005-06-16 18:25:55 UTC

I'm inclined to agree. Experience of using this function suggests it's very hard
to remember which way to set the boolean argument, and the resulting code is not
clear to the reader. I think we were over-influenced by pressure to minimize the
number of functions.

Perhaps suitable names might be escape-uri() and escape-uri-part().

I haven't seen use cases for an unescape-uri() function, but I agree there's an
argument for it based on completeness. 

Michael Kay (personal response)

Comment 2 Michael Rys 2005-06-16 19:17:44 UTC

Tim

Semantically, your design is slightly cleaner. However it would mean that you 
have to traverse the string twice instead of once if both transformations are 
required.

So the question becomes which of the transformations are more common.

Michael Rys (personal response)

Comment 3 Tim Berners-Lee 2005-06-17 02:05:04 UTC

Michael (Kay), the two functions are *quite different* as I understand it.  It is not that one operates on 
part and the other on a whole URI.  You can feed a whole or part URI to either.

encode-for-uri(s) takes ANY STRING (not necessarily any relation to a URI) and encodes it as a 
something which can be transferred as path segment.  It is an encoding in that there is a corresponding 
decode.  if you use it twice, then you get something double-encoded. Example: Use when encoding a 
string argment to a HTML-form-style query.

clean(s) takes a URI (or part) and just cleans it up so that any unacceptable characters are encoded in 
ASCII.  It doesn't encode anything which is already encoded. There is no inverse function, as you can't 
tell what characters were not originally clean in the original string.  If you use it twice, its the same as 
using it once. once.  Example:  use when encoding an IRI for transmission in HTTP.

Why would you want to perform both operations?  The result of encode-for-uri will allways be clean so 
performing a clean()n will have no effect.  The result of cleaning a URI will be a clean URI whcih one may 
want to then encopde as a URI encoded parameter within a new query URI being built up. But that is a 
separate function, and should be programmed as such.

Comment 4 Michael Kay 2005-06-17 03:55:47 UTC

TimBL>Michael (Kay), the two functions are *quite different* as I understand it.
 It is not that one operates on 
part and the other on a whole URI.  You can feed a whole or part URI to either.

MHK>I don't think there is any disagreement that the two operations are
different, or about the definition of the two operations, or about the reasons
why we need to provide both. The question is how to package the two operations
to maximize ease of use. 

That's why I suggested names based on the recommended use cases for the two
functions. One of them is there to allow you produce a URI from a wannabe-URI (I
wish we had a better name for the thing), the other is there to enable you to
produce a component of a URI from an arbitrary string. 

We've always had to recognize that the name of a function can't encapsulate the
entire semantics of what the function does; the main aim is to choose names that
users will find easy to remember and distinguish. 

From that perspective, I don't think that "clean" is a good name, because it
doesn't even hint that the function has anything to do with URIs, and it's quite
unrelated to the terminology of the RFCs that describe the operation in more
detail. 

"encode-for-uri" is a more reasonable suggestion, since it's related to the term
"percent-encoding" used in RFC 3986 (replacing "escaping" in RFC 2396). But the
verb "escape" to describe this operation is well-entrenched in other W3C
specifications (XSLT 1.0, HTML, XLink) and therefore in the consciousness of the
user community, while the verb "encode" reminds one of the unfortunate history
in which the result of this operation at one time depended on the character
encoding of the containing document. I don't think one can argue that "encode"
is a better name for the operation because it's reversible: most escape
conventions are reversible too.

If we're going to insert a preposition to emphasize that it's the output that's
a URI, not the input, then "as" would be a better choice than "for".

TimBL:>Why would you want to perform both operations?

MHK>You wouldn't want to do so, I didn't intend to suggest that you would.

Michael Kay

Comment 5 Ashok Malhotra 2005-06-30 13:09:53 UTC

On the joint telcon on 6/28/2005 the WGs agreed to remove the fn:escape-uri
function and replace it with 2 functions called fn:encode-for-uri and
fn:iri-to-uri corresponding to the behaviour of fn:escape-uri with the parameter
escape-reserved set to TRUE and FALSE respectively.

I would appreciate interesting examples to include in the description of these
functions.

Ashok Malhotra