This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
PB -- RFC says application should make sure never to escape a string twice. HT -- XLink explicitly does _not_ escape # or %. I think that was our last group consensus. PB -- believe they should include a health warning. Will cause confusion. HT -- that may be the right resolution. Propose -- "The question of escaping % is vexed. The issue should be explained in an accompanying note so that readers will be confident that this is an intentional omission." XG -- sounds fine. ... SG -- we don't have such a health warning in our spec. HT -- I think that's OK. We need to raise an issue against 3.2.17.1.
Discussed at 2005-10-28 telecon [1]. Classified as clarification with corrigendum. Instruct the editors to add the proposed warning. [1] http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/2005Oct/0022.html
It's not clear that XSDL 1.1 actually needs a health warning about escaping IRIs / anyURI values; it prescribes the escaping algorithm found in section 3.1 of RFC 3987 (http://www.ietf.org/rfc/rfc3987.txt) and that algorithm is said by that spec to be idempotent: The above mapping from IRIs to URIs produces URIs fully conforming to [RFC3986]. The mapping is also an identity transformation for URIs and is idempotent; applying the mapping a second time will not change anything. Unless I'm missing something, therefore, the premise of this issue (which I understand to be that we should warn users not to escape their anyURI values more than once) is false for 1.1, and no change is needed in 1.1 on this account. On the other hand, as far as I can tell, the escaping algorithm specified by XSDL 1.0, borrowed from section 2.4 of XLink (http://www.w3.org/TR/2001/REC-xlink-20010627/#link-locators) is also idempotent. So I don't see what the health warning is supposed to be warning the users about, unless it's about the possible danger of confusion if they try to use anyURI values which already contain percent signs or hashes which have not already been escaped. (And if it is, then the chance of alleviating the confusion in a note may be somewhat smaller than the chance of baffling the reader.) Note that neither # or % may appear in an RFC 2396 URI unescaped: only 'nonreserved' characters may appear unescaped according to RFC 2396. RFC 2732 does not change this rule, as far as I can tell. If the WG really wants a health warning for the 1.0 spec, perhaps something like this will do the trick: Note: the escaping mechanism prescribed here does not escape the percent sign (%) or the number sign (#); as a result, escaping a string a second time will not change the string. Nevertheless, the observations in RFC 2396 about safety and risks of escaping mechanisms remain true, and implementors should bear them in mind. Speaking for myself, however, I incline to think that we should close this bug both for 1.0 and for 1.1 with a resolution of WORKSFORME, signaling that now (as opposed to two years ago) we do not see any need for change, health warning, or clarification. (Alternatively, someone could explain to me, again, why there is a problem after all, all appearances to the contrary notwithstanding.)
We currently say: The ·lexical space· of anyURI is finite-length character sequences. Note: For an anyURI value to be usable in practice as an IRI, the result of applying to it the algorithm defined in Section 3.1 of [RFC 3987] should be a string which is a legal URI according to [RFC 3986]. A consequence of these two rules is that (a) an unescaped % sign (that is not part of an escape sequence) is allowed in an anyURI value, and (b) if such a % sign actually appears in an anyURI value, the value is not "usable in practice". If this is indeed what we want the rules to say, it seems to me that it is desirable to point out the consequences. I would add a note saying in effect that unescaped % signs are allowed but make the anyURI value useless.
(Oh, and while we're about it, can't we get rid of this "finite-length" nonsense? There is no way a schema processor can be written to detect or reject infinite sequences, so restricting them to be finite has no practical effect.)
Comment #4 suggests deleting the words 'finite-length' from the description of anyURI's lexical forms and from similar passages elsewhere. That seems to me at best a risky idea. It's true that no implementation we know how to build will be able to distinguish the set of finite-length sequences from the set of infinite sequences. But I think the stipulation that the sequence of characters be finite has another utility: it helps ensure that the question "is this string in the lexical space of this datatype?" can be answered in finite time. For arbitrary strings of finite length, and arbitrary grammars of finite size, there are algorithms which can decide whether the string is recognized by the grammar. I do not believe the same is true for infinite strings. (At least, not if we accept the view that to be an algorithm, a computational method must always terminate after a finite number of steps.) So I do not think it would be wise to define the lexical spaces of any of our simple types as including infinite-length strings. It's not solely a practical question; it's also a theoretical question. (It's true that some students of automata theory have worked with automatata with infinite numbers of states, and that two-level [van Wijngaarden] grammars correspond to context-free grammars with infinite numbers of productions. But in both of these cases, I think, the existence of an algorithm for recognizing strings depends crucially upon the strings' being finite in length. In recognizing a finite-length string, even an infinite state machine or grammar can successfully be represented by a finite machine or grammar which produces the same result for this particular string. I am not a computer scientist, and I have not taken the time this morning to re-read my textbooks on this subject, so I could be wrong. So I will offer this weaker argument: almost all the textbooks I have seen describe algorithms for recognizing finite-length, not infinite strings. In view of the finiteness of any actual machine available to us, it seems unnecessary to require that implementors replace the textbook algorithms with others capable of handling infinite inputs.)
The Working Group discussed this issue during its call of 21 September (http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/2007Sep/0004.html) [member-only link]. On the arguments raised in comment #2 and comment #3, there was some sympathy on both sides, but we did not reach clarity or agreement on what, precisely, a health warning on this issue should say. On the point raised in comment #4 about the finiteness of strings, the WG was persuaded by arguments analogous to those in comment #5, to the effect that the finiteness of lexical representations is important for theoretical reasons, even if no finite implementation can ever recognize more than a small subset of some lexical spaces. The upshot was that the WG decided to close this issue as WORKSFORME, signaling that on further reflection we don't have agreement that there is actually a problem to solve here. I'm changing the status of the issue accordingly. It was noted that neither Paul Biron (who appears from the description to have raised the issue initially) nor Michael Kay (who has expressed a view in comment #3) was present; either of them (or anyone else, for that matter) can reopen the issue if they can provide further information that persuades the WG that there is a problem after all.
p.s. Discussion during the telcon made clear that it should also be pointed out that what is at issue here are not arbitrarily long strings (i.e. strings whose length is some large integer) but infinitely long strings (i.e. strings whose length is not equal to any integer at all, but is infinite). Arbitrarily long strings ARE included in the descriptions of our lexical spaces. A discussion of the relevant issues which some WG members have found interesting in the past is given in D. Terence Langendoen and Paul M. Postal, The Vastness of Natural Languages (London: Basil Blackwell, 1984). Langendoen and Postal are concerned with natural not artificial languages, but their discussion of the mathematical and formal background is relevant also for our concerns.
In response to comment #5: "it helps ensure that the question "is this string in the lexical space of this datatype?" can be answered in finite time.". I don't think this is true. If I imagine an infinite document supplied as input to a streaming validator, then I cannot in finite time decide whether the document is valid or not, and this is true whether or not I constrain the strings within the document to be finite. Saying they must be finite therefore adds nothing. Looking at it another way, in your paper at Extreme Markup 2005 http://www.idealliance.org/papers/extreme/proceedings/html/2005/SperbergMcQueen02/EML2005SperbergMcQueen02.html you wrote "I pride myself on my spec draftsmanship, but [...] is not a definition I would want to make; it's not something that would turn into what the QA people would consider a testable assertion." Quite right: it's a good idea not to say something in a spec unless it's a testable assertion. And in my view, saying that a string must be finite falls firmly into the non-testable category.