2249 – R-257: Health warning needed about percent-escaping URIs

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 2249 - R-257: Health warning needed about percent-escaping URIs

Summary: R-257: Health warning needed about percent-escaping URIs

Status:	CLOSED WORKSFORME

Alias:	None

Product:	XML Schema
Classification:	Unclassified
Component:	Datatypes: XSD Part 2 (show other bugs)
Version:	1.0/1.1 both
Hardware:	All All

Importance:	P2 normal
Target Milestone:	---
Assignee:	C. M. Sperberg-McQueen
QA Contact:	XML Schema comments list

URL:
Whiteboard:	medium, easy
Keywords:	resolved

Depends on:
Blocks:

Reported:	2005-09-14 19:46 UTC by Sandy Gao
Modified:	2009-04-21 19:21 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Sandy Gao 2005-09-14 19:46:24 UTC

PB -- RFC says application should make sure never to escape a string twice. 

HT -- XLink explicitly does _not_ escape # or %. I think that was our last 
group consensus. 

PB -- believe they should include a health warning. Will cause confusion. 

HT -- that may be the right resolution. Propose -- "The question of escaping % 
is vexed. The issue should be explained in an accompanying note so that readers 
will be confident that this is an intentional omission." 

XG -- sounds fine. 

... 

SG -- we don't have such a health warning in our spec. 

HT -- I think that's OK. We need to raise an issue against 3.2.17.1.

Comment 1 Sandy Gao 2005-10-28 19:05:35 UTC

Discussed at 2005-10-28 telecon [1].

Classified as clarification with corrigendum. Instruct the editors to add the 
proposed warning.

[1] http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/2005Oct/0022.html

Comment 2 C. M. Sperberg-McQueen 2007-09-19 23:50:11 UTC

It's not clear that XSDL 1.1 actually needs a health warning about
escaping IRIs / anyURI values; it prescribes the escaping algorithm
found in section 3.1 of RFC 3987 (http://www.ietf.org/rfc/rfc3987.txt)
and that algorithm is said by that spec to be idempotent:

   The above mapping from IRIs to URIs produces URIs fully conforming to
   [RFC3986].  The mapping is also an identity transformation for URIs
   and is idempotent;  applying the mapping a second time will not
   change anything. 

Unless I'm missing something, therefore, the premise of this issue
(which I understand to be that we should warn users not to escape
their anyURI values more than once) is false for 1.1, and no change
is needed in 1.1 on this account.

On the other hand, as far as I can tell, the escaping algorithm 
specified by XSDL 1.0, borrowed from section 2.4 of XLink
(http://www.w3.org/TR/2001/REC-xlink-20010627/#link-locators)
is also idempotent.  So I don't see what the health warning is
supposed to be warning the users about, unless it's about the 
possible danger of confusion if they try to use anyURI values
which already contain percent signs or hashes which have not 
already been escaped.  (And if it is, then the chance of alleviating
the confusion in a note may be somewhat smaller than the chance
of baffling the reader.)  Note that neither # or % may appear
in an RFC 2396 URI unescaped: only 'nonreserved' characters may
appear unescaped according to RFC 2396.  RFC 2732 does not
change this rule, as far as I can tell.

If the WG really wants a health warning for the 1.0 spec, perhaps
something like this will do the trick:

  Note:  the escaping mechanism prescribed here does not
  escape the percent sign (%) or the number sign (#); as a
  result, escaping a string a second time will not change
  the string.  Nevertheless, the observations in RFC 2396 
  about safety and risks of escaping mechanisms remain true,
  and implementors should bear them in mind.

Speaking for myself, however, I incline to think that we should
close this bug both for 1.0 and for 1.1 with a resolution of
WORKSFORME, signaling that now (as opposed to two years ago)
we do not see any need for change, health warning, or 
clarification.  (Alternatively, someone could explain to me, again,
why there is a problem after all, all appearances to the
contrary notwithstanding.)

Comment 3 Michael Kay 2007-09-20 23:30:07 UTC

We currently say:

The ·lexical space· of anyURI is finite-length character sequences.

Note: For an anyURI value to be usable in practice as an IRI, the result of applying to it the algorithm defined in Section 3.1 of [RFC 3987] should be a string which is a legal URI according to [RFC 3986].

A consequence of these two rules is that (a) an unescaped % sign (that is not part of an escape sequence) is allowed in an anyURI value, and (b) if such a % sign actually appears in an anyURI value, the value is not "usable in practice". 

If this is indeed what we want the rules to say, it seems to me that it is desirable to point out the consequences. I would add a note saying in effect that unescaped % signs are allowed but make the anyURI value useless.

Comment 4 Michael Kay 2007-09-20 23:35:19 UTC

(Oh, and while we're about it, can't we get rid of this "finite-length" nonsense? There is no way a schema processor can be written to detect or reject infinite sequences, so restricting them to be finite has no practical effect.)

Comment 5 C. M. Sperberg-McQueen 2007-09-24 17:14:07 UTC

Comment #4 suggests deleting the words 'finite-length' from the
description of anyURI's lexical forms and from similar passages
elsewhere.

That seems to me at best a risky idea.

It's true that no implementation we know how to build will be able to
distinguish the set of finite-length sequences from the set of
infinite sequences.  But I think the stipulation that the sequence of
characters be finite has another utility: it helps ensure that the
question "is this string in the lexical space of this datatype?" can
be answered in finite time.

For arbitrary strings of finite length, and arbitrary grammars of
finite size, there are algorithms which can decide whether the string
is recognized by the grammar.  I do not believe the same is true for
infinite strings.  (At least, not if we accept the view that to be an
algorithm, a computational method must always terminate after a finite
number of steps.)  So I do not think it would be wise to define the
lexical spaces of any of our simple types as including infinite-length
strings.

It's not solely a practical question; it's also a theoretical question.

(It's true that some students of automata theory have worked with
automatata with infinite numbers of states, and that two-level [van
Wijngaarden] grammars correspond to context-free grammars with
infinite numbers of productions.  But in both of these cases, I think,
the existence of an algorithm for recognizing strings depends
crucially upon the strings' being finite in length.  In recognizing a
finite-length string, even an infinite state machine or grammar can
successfully be represented by a finite machine or grammar which
produces the same result for this particular string.  I am not a
computer scientist, and I have not taken the time this morning to
re-read my textbooks on this subject, so I could be wrong.  So I will
offer this weaker argument: almost all the textbooks I have seen
describe algorithms for recognizing finite-length, not infinite
strings.  In view of the finiteness of any actual machine available to
us, it seems unnecessary to require that implementors replace the
textbook algorithms with others capable of handling infinite inputs.)

Comment 6 C. M. Sperberg-McQueen 2007-09-24 17:21:48 UTC

The Working Group discussed this issue during its call of 21 September
(http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/2007Sep/0004.html)
[member-only link].

On the arguments raised in comment #2 and comment #3, there was some 
sympathy on both sides, but we did not reach clarity or agreement on what,
precisely, a health warning on this issue should say.  On the point 
raised in comment #4 about the finiteness of strings, the WG was
persuaded by arguments analogous to those in comment #5, to the effect that
the finiteness of lexical representations is important for theoretical 
reasons, even if no finite implementation can ever recognize more than
a small subset of some lexical spaces.

The upshot was that the WG decided to close this issue as WORKSFORME,
signaling that on further reflection we don't have agreement that there
is actually a problem to solve here.  I'm changing the status of the
issue accordingly.

It was noted that neither Paul Biron (who appears from the description to
have raised the issue initially) nor Michael Kay (who has expressed
a view in comment #3) was present; either of them (or anyone else, for
that matter) can reopen the issue if they can provide further information
that persuades the WG that there is a problem after all.

Comment 7 C. M. Sperberg-McQueen 2007-09-24 17:29:13 UTC

p.s. Discussion during the telcon made clear that it should also be pointed
out that what is at issue here are not arbitrarily long strings (i.e. 
strings whose length is some large integer) but infinitely long strings
(i.e. strings whose length is not equal to any integer at all, but is
infinite).  Arbitrarily long strings ARE included in the descriptions of 
our lexical spaces.

A discussion of the relevant issues which some WG members have found
interesting in the past is given in D. Terence Langendoen and Paul M. Postal,
The Vastness of Natural Languages (London: Basil Blackwell, 1984).  
Langendoen and Postal are concerned with natural not artificial languages,
but their discussion of the mathematical and formal background is 
relevant also for our concerns.

Comment 8 Michael Kay 2007-09-24 18:18:56 UTC

In response to comment #5: "it helps ensure that the question "is this string in the lexical space of this datatype?" can be answered in finite time.".

I don't think this is true. If I imagine an infinite document supplied as input to a streaming validator, then I cannot in finite time decide whether the document is valid or not, and this is true whether or not I constrain the strings within the document to be finite. Saying they must be finite therefore adds nothing.

Looking at it another way, in your paper at Extreme Markup 2005

http://www.idealliance.org/papers/extreme/proceedings/html/2005/SperbergMcQueen02/EML2005SperbergMcQueen02.html

you wrote "I pride myself on my spec draftsmanship, but [...] is not a definition I would want to make; it's not something that would turn into what the QA people would consider a testable assertion."

Quite right: it's a good idea not to say something in a spec unless it's a testable assertion. And in my view, saying that a string must be finite falls firmly into the non-testable category.