Mappings and identity in URIs and IRIs

Preface: This document was originally written in 2003, before the IRI spec was an RFC. Some of this has since been addressed in the RFC.

Summary: There is a discrepancy between namespaces and URI specs about what identifiers are equivalent. The ony reason this has not caused a problem is that in practice the test cases (two equivalent but not equal unicode character sequences being used) has not occurred in practice. Using IRIs maliciously could however deliberately introduce a bug which could cause a security problem.

Using relationship notation (why not use N3?) to discuss the inconsistencies between some current thinkings about IRIs, URIs, and for example namespace names.

Requirements:

1. URI identity is shared by all parties. Within a given context (*), there is a single (inverse functional) relationship between an ASCII string a and a thing x identified by a string s taken as a uri is uri(x, a).

2. The users of any specification which mention URIs, when one can prove that the two are equivalent by reading [scheme-independent] specs, then one can use one in place of another. That is, when URI (or IRI) strings are deemed "equivalent" then they must refer to the same object.

3. We should be able to use the same software to parse and compare URIs wherever they are used, eg in namespace names or in hypertext links.

What do we get from the specs?

Let us formalize the concepts in the documents we are talking about.

The URI spec

uri(x,a) => A(a)

where A(a) means that it is a sequence of ASCII characters (grounded in ANSI X3.4-1986).

The ANSI spec gives a 1:1 mapping ascii(a, s) from the set A of ASCII character to the set S of septets (integers between 0 and 127 inclusive).

Let sames(s1, s2) be the "strcmp" relation between two strings which are septet for septet identical.

Consider the equivalence relation ea(a1, a2) which we use here to indicate that two uris identify the same thing. It (is symmetric and transitive and) has properties

ea(a1, a2) & uri(t1, a1) => uri(t1, a2)

for all a1, a2 (for some t uri(t, a1) & uri(t, a2)) <=> ea(a1, a2)

A(a) <=> ea(a,a)

Now in fact we are going to deal with the ASCII encoded septets for which a similar equivalnce holds

es(s1, s2) <=> Exists a1, s2 such that ascii(a1, s1) & ascii(a2, s2) & ea(a1, a2)

The URI spec mentions two uses of hexadecimal encoding. Hex encoding relates octet strings to septet strings. When the URI spec was written, the significance of the octets greater than 127 was not defined.

It implies that if you see %HH in a URI you should consider it as an encoding of an octet. There is (a the level of this spec) the notion that the URI is an encoding of a string of octets. Those from 0-127 are considered as representing ASCII characters. There is no assumption about what the others represent. The IRI spec will later take advantage of this.

hexify(s1, s2) is true if the difference if any between s1 and s2 is only that for one or more characters in s1 are replaced in s2 by their %HH or %hh encoding, and ascii(s2).

ascii(s) => hexify(s, s)

hexify(s, s)

There are another 128 characters in this notional "extended" set, each of which has a hex encoding.

(DanC: hexify(s+ c, s+hexify(c))

hexify('A') = '%65'

corrollary: hexify(s1, s2) => ascii(s2))

I take hexify to be a subrelation of equality. That is, the URI spec authorizes one to use s2 where you would have used s2. In some cases such as 7-bit transport such as HTTP you have to. It is important that hexification preseves the identity of the resource.

hexify(t, s1) & hexify(t, s2) => es(s1, s2)

{ for some s, hexify(t1, s) & hexify(t2, s) } <=> et(t1, t2)

Note that equivalence is preserved by the interchange of "%20" with " ", but not by interchange of "%2F" with "/".

rel(s, b, r) many-many relation between ascii strings, that r is a relative URI reference for s relative to b. Implication of spec is

THE Unicode Spec

UTF-8 [Unicode 3.2] gives us a relation utf8(i, s)

Note by the way that

ascii(s) => utf8(s,s)

utf8(i, s) is true if i is a string of unicode characters, and s is an extended ASCII string of octets, and the relationship is as specified in the utf-8 specification.

sameu(i1, i2)

is true whenever the two unicode strings convey exactly the same series of glyphs and/or control characters. There are strings which are not identical

The IRI spec

This says that (basically, with some work on corner cases etc) there should be a convention that any 8-bit string which is not ASCII which can be interpreted as a UTF-8 encoding should be interpreted as a uitf-8 encoding.

There is a cannonicalization function which the IRI spec uses, defined in @@, which allows a particular

ucan(i,i)

Axioms are that it is a function:

ucan(i, j1) . ucan(i, j2) => strcmp(j1, j2)

for all i: can(i,i)

e(s1, s2).

There is a function (not 1:1) which we define as

iri_uri(i, s) <=> for some j, t: ucan(i, j). utf8(j, t). hexify(t, s)

IRIs are defined as the domain that function, where the range is URIs. An IRI is any unicode string which when canonicalized and utf-8 encoded and hexified is a URI.

There is a uri equivalent to every iri. There is NOT an IRI for every 8-bit string t. There is at least one IRI for every URI: itself.

For requirement 2, equivalent IRIs must identify the same

iri_uri(i1, s1). iri_uri(i1, s2). sameu(i1, i2) => e(s1, s2)

The namespace spec

The namespaces specification 2.3 talks about identifiers being different. Specifically, "http://www.example.org/ros%c3%a9" and "http://www.example.org/ros%C3%a9" are different. Let's call these constant strings D1 and D2 for short.

ne(D1, D2)

Now "difference" is something which allows them for example to occur as different attributes in an XML element. It seems to me that this is ne is the negation of e. It is the common understanding of differentness such that two things can't be both different and the same. To make it otherwise would be very confusing and would prevent (3).

ne(s1, s2) => ~e(s1, s2)

Ouch. We have one spec saying that these are different, and another saying that they are the same.

That isn't logically compatible. The whole layering of the different forms of equality described in Tim Bray's draft finding is of the form

e_uri(s,t) => e(s,t)

e_http(s,t) => e_uri(s,t)

So if you accept the requirements above, and you accept any of the equivalences we have to throw out thatpart of XML namespaces.

Choices

1. ignore the equivalences like the namespace spec. This causes a bug if anyone uses two identfiers which are diffrent strings but equivalent. The only practical way of doing that is to make any non-canonical IRIs or URIs illegal. This means IRIs cannot be used except in their trivial URI form.

2. Transmit in any form, receiver makes right. Receiver must compare equivalnce-sware or must cannonicalize before intrenal use (whichhas the same effect).

3. Make IRIs be just unicode strings. Scratch the axiom that hexifying leaves a valid and equivalent IRI. Allow the hexified forms to be used to identify quite different things, in IRIs. Allow IRIs to be converted into URIs, but NOT allow any place where URIs and IRIs can be used interchangebly. This works toward a DanC-proposed world of unicdoe character string comparison. It does not allow a smooth transiition for existing browsers etc whcih mix URIs and IRIs.

Reality factors

There are NOT very many actual uses of D1 and D2, because there aren't really any motivations for making them.

There ARE motivations for using (non-uri) IRIs. people are infact using them though maybe not for namespaces yet.

There are NOT many if any uses of different IRIs or different URIs for the same namespace.

Conclusion

We should continue the recommendation not to use URIs or IRIs which are equivalent but arbitrarily different strings. The easist way of ensuing this is to use a cannonical form. We can therefore deprocate the transmission or use of non-canonical forms.

We should switch as soon as possible to canonicalizing IRIs in all applications before comparison (or using equiavlence-aware comparisons). The Namespaces spec should change to say when things are the same. the constraint in XML to constrain that attributes cannot occur twice should be made more complicated. It should say that you can't have two occurrences which are the same attribute name, or two attrributes which are equivalent in any way, leaving I regret some fuzziness. For example, you can't use the xhtml1.0 and xml1.1 namespaces in the same document to put two src attributes on an image! they arenot even the same namespace, but clearly they are equivalent at the application level. It should be clear that the fact that strings are different is not a guarantee that the namespaces are different. The parser just isn't expected to spot this. But I think the parser ought to be allowed to consistently cannonicalize. That makes life much easier for the application. DanC wanted to be able to do strcmp, and he can if the parser canonicalizes.

We should then in a few years be able to relax the constraint on not transmitting multiple different forms.

We should formalize with names the various functions above, and make sure there are good working coded implmentations of them in the mjor languages. A standard API will help. URI working group stuff.

References

Footnotes:

The foundational architecture of the web is that there is a global context common to all publically published documents, in which each URI is agreed by everyone to identify the same thing. In practice of course, things break and people are confused and misled. Those making formal systems often restrict the scope of data to that in which this ideal approximation can be taken to hold in practcie as well as in theory.

The fact that the use of uris varies with time (sad but true) (we are NOT talking about living documents or concepts whose reopresentations change, here, but really reuse of the same URI for a totally different concept) means that to model things over a relatively long time one might want to model the time varying nature:

This time modelling can be done and has been done in many ways, but is not addressed here.