XML/URI Questions and Answers

DRAFT $Revision: 1.2 $
of $Date: 2000/06/20 07:00:03 $ by $Author: connolly $

Q: The namespaces spec used to say that an XML namespace was a system identifier [@@cite/check], i.e. a URI. Then, at the last minute[@@], it changed to say that an XML namespace was a URI reference. Why? What's the difference?

A: Leaving why aside for starters... let's first contrast the terms URI reference and absolute URI. An absolute URI is the thing you (typically) hand to your network software to say "please fetch some content using this identifier". The following are absolute URIs:

http://www.w3.org/
http://example.org/seg1/seg2/last-path-seg
http://www.ietf.org/rfc/rfc0822.txt
mailto:connolly@w3.org
mid:2l34kj2lj2lkj34lk2j32@w3.org
urn:example-ns:more-stuff-here
file://myhost.example.com/dir1/dir2/stuff
file:/dir1/dir2/stuff
scheme-you-never-heard-of:more-stuff-here

In contrast, the following are not absolute URIs:

/seg1/seg2/last-path-seg
hello.dtd
/seg1/seg2/last-path-seg#frogs
http://www.w3.org/XML/#9802xml

The first three are obviously not absolute. The last one is absolute, but the syntax of absolute URIs doesn't include fragment identifiers. The syntax of URI references includes all of the above. That's at least part of the answer to why the change from URI to URI reference in the namespaces spec: because the designers intended for fragment identifiers to be allowed in namespace names.

Unfortunately, the URI specification does not have a term for 'an absolute URI with an optional fragment identifier'. For the purpose of this discussion, let's call that a URI+.

Note that the semantics of URI references is very different from the semantics of absolute URIs. The typical algorithm for following a hypertext reference in the Web is: given an absolute URI--let's call it b--and a URI reference--call it r:

abs = combine(b, stripFragment(r))
bytesOfContent, mime-type, ...other metadata... = getRemoteContent(abs, language/content preferences)
displayFactory(mime-type).display(bytesOfContent).viewFragment(getFragment(r))

[@@I say typically, because, for example, most user agents don't have a facility for doing getRemoteContent on mailboxes (mailto: URIs). They do a prepareToPostTo(absURI) in stead.]

So the semantics of an absolute URI is, as the name suggests, to identifiy a resource; a URI+ then identifies a resource or some fragment of/view on a resource. The semantics of a URI reference, on the other hand, is that it denotes a URI+ when combined with a base absolute URI. To reiterate: a URI+ identifies a resource, but a URI reference refers to a URI+.

Q: You explained absolute URI, URI reference, and URI+, but you never told me what a URI is.

A: the URI spec, unfortunately, doesn't define the term URI. It characterizes URIs, gives examples, etc., but doesn't specify syntax nor define the term as such. A conservative (conventional?) reading of the spec infers that the term URI means absolute URI.

Q: But... the XML 1.0 spec uses the term URI. What does it mean by URI?

A: it seems to mean URI reference, since it gives examples in relative form and explains how to expand system identifiers to absolute form. But it seems to treat fragment identifiers as an error. I think the XML Core WG is dicussing this; stay tuned to the errata@@.

Q: The XML 1.0 spec also says how to treat non-ascii characters in system identifiers; I thought URIs only contained ASCII characters. What's up?

A: XML 1.0 follows HTML 4.0 in using a slightly sloppy specification for how to "handle a non-ASCII character in a URI". In fact, there is no such thing as a non-ASCII character in a URI (absolute URI or URI reference). But there is a convention for interpreting a string of Unicode characters as a URI reference by UTF-8 encoding and %HH escaping the non-ASCII characters. For this discussion, we'll follow the Internet Draft Internationalized Uniform Resource Identifiers (IURI) (see also: background on i18N of URIs) and use the term IURI reference for a Unicode string that is intended to represent a URI reference in this fashion.

Q: Er... so... if a URI reference refers to URI+, and a namespace name is a URI reference, then the thing that a namespace name refers to is a URI+, right? i.e. a namespace is a URI+???

A: That's one logical conclusion of a very literal reading of the namespaces spec as written.

@@@@@

Q: but file:/dir1/dir2/stuff denotes different resources depending on which machine you're using, no?

A: one view is "no, it always refers to /dir1/dir2/stuff on the machine you're using, just like http://my.yahoo.com/ always refers to the client's personalized version of yahoo." @@

Q: Is a base URI intrinsic to an XML document? In other words: if I have a simple <helloWorld/> document at http://example.org/dir1/hello.xml , and I move it to http://example.org/dir1/hello.xml, do I still have the same XML document, is is this a different document?

A: This question isn't answered by the ratified specs.

For the purpose of this discussion, let's refine the terms so that a disconnected XML document has no intrinsic base URI, wheras a connected XML document does. So <helloWorld/> is the same disconnected document regardless of what base URI you use to find it, but it's (the text of) two different conntected XML documents, one whose base URI is http://example.org/dir1/hello.xml, and on whose base URI is http://example.org/dir1/hello.xml.

In the current [@@november 1999?] infoset spec, the base URI is a property of a document info item, and hence a property of a document, and hence different base URIs imply different documents; i.e. by document it means connected XML document. We could have followed the disconnected XML document usage, but that would have made the specification of external entities and stuff more rhetorically complex [@@exaplain/justify].

Q: Does every XML document have a base URI? What about documents from stdin, or documents in memory?

A: choose from:

sometimes there's no base URI, or a null base URI. then what to make of <!DOCTYPE greeting SYSTEM "hello.dtd">? Let's denote the union of null with the set of URIs as *URI.
the base URI is sometimes unspecified (aka application-defined). This simplifies the specs, and it's effectively the same as "sometimes there isn't one" [@@explain/justify].