On URIs, resources, and documents

Dan Connolly, Jul 2002
DRAFT $Revision: 1.30 $ of $Date: 2002/07/22 22:54:02 $ by $Author: connolly $

cf tag issue 14

@@hmm... not sure if this is really a FAQ or more of a socratic dialog...

Part I: Background Terminology

What's the relationship between URIs, resources, and documents?

That's a tricky question; let's start by getting clear what you mean by each of those in isolation, before we talk about their relationship.

By URI, I bet you mean the sort of thing you put in your HTML files, right? Note that this is subtly different from the meaning of the term URI in RFC2396, the most widely ratified specification of the term.

What is a URI?

It's a string of characters, starting with scheme: . e.g.

http://www.w3.org/
ftp://ftp.w3.org/
irc://irc.openprojects.net/rdfig
urn:oasis:names:tc:SAML:1.0:assertion
tel:+1-913-555-1212

See RFC2396 for details.

I didn't know there were URIs for IRC channels and telephones; what other URI schemes are there?

acap, cid, data, etc. IANA keeps a list of registered URI schemes.

Can I make up a URI scheme? is `myscheme:blort` a URI?

Well, myscheme:blort meets the syntactic constraints of RFC2396, so yes, it's a URI. But myscheme isn't registered, so you don't have license to use that URI in any Internet protocols; there aren't any valid uses of it. You can't expect anybody to know what you mean by it, and you aren't guaranteed that somebody else isn't already using it for something else. But we're getting ahead of ourselves, into the relationship of URIs to resources...

If you want to register a new scheme, follow the IETF process, esp the guidelines in RFCNNNN@@.

Is `../foo` a URI?

No. It is a URI reference, though. And in the context of a base URI (i.e. the address where you got the document in which you found ../foo; say http://aDomain/aPath/xyz) it's well-defined what URI ../foo corresponds to: http://aDomain/foo. See section @@ of RFC2396 for details.

What's the absolute form of `foo/bar` w.r.t. `mailto:somebody@somedomain`?

Now you're getting pedantic. I said it was well-defined; I didn't say it was intuitive. How often do you get documents from that base URI? That's not really a frequently asked question.

The answer is mailto:foo/bar, by @@recent interpretation of RFC2396. Implementation experience is not completely consistent.

[@@weasle words about how it's useful to eventually get this nailed down, i.e. developing a URI test suite, but the bugs aren't costing us much in the mean time.]

Does the relative-to-absolute algorithm depend on schemes?

No.

@@more?

Is `http://example/aPath/myDoc.html#section2` a URI?

No, per RFC2396, URIs don't include # characters. URI references do.

What is a Resource?

Just about anything.

From RFC2396:

A resource can be anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., "today's weather report for Los Angeles"), and a collection of other resources. ...

Are things like people and library books Resources?

Yes. RFC2396 continues...

... Not all resources are network "retrievable"; e.g., human beings, corporations, and bound books in a library can also be considered resources.

What is a document?

Unfortunately, the term is pretty ambiguous in this context.

We often think of files as documents. You can save the file several times, and you still think of it as the same document. When the U.S. Constitution is amended, we think of it as the same document.

For the purpose of this discussion, let's please reserve the term document to mean a sequence of bytes paired with an Internet Media Type [What's an internet media type? Well, it's much like a URI scheme... never mind; let's not go into that just now, OK?]. i.e. the contents of a myDoc.html file on your disk at any one time is a document, but if you revise it and save it, you've got a new document. If you want to talk about the mutable file, let's use resource for that.

This use of document corresponds to the term entity body in Hypertext Transfer Protocol -- HTTP/1.1, RFC2616, and in the MIME specs, e.g. RFC2045.

[more motivation for using document this way: XML specs. Note that per Infoset REC, the base URI of an XML document is intrinsic to that document; so if you take your file and copy it somewhere else in the Web, you get a new document.]

Perhaps we need another term for resources that are often called documents...

What's a work?

Each work is a resource which is closely related to one or more documents; e.g. an essay, a book, a computer program, the U.S. Constitution, etc.

"An abstract information thing of value, typically intellectual property" -- timbl's doc schema.

see also: Conceptual Works in the OpenCyc ontology.

Ok; now that we have our background terms straight...

Part II: Relating the Terms

So What's the relationship between URIs and resources?

First, each valid use of a URI reference unambiguously identifies one resource. That is: if you are using a registered URI scheme and following all the other relevant protocol specifications, it is unambiguous what resource you are referring to. This goes for all URI references, not just URIs.

Is use of unregistered URI schemes in the public Internet valid?

No; As we mentioned above, if you use an unregistered URI scheme, you don't have any guarantees that somebody else isn't using the same URI to mean something else in the same protocol message.

[@@elaborate with more of an example?]

In an HTTP transaction, what's the relationship between the request URI and the document in the 200 (OK) response?

This very typical example is worth some elaboration...

You ask your browser (or other user agent) to visit http://example.org/aPath/myDoc ; that's called the request URI, in the HTTP specification. Your user agent looks up example.org in DNS and gets an IP address back; makes a TCP connection, and sends a request. If all goes well, you get an HTTP 200 (OK) response back, containing a document: the media type is indicated by the Content-Type header field, and the byte sequence is in the body of the HTTP response.

Provided the HTTP transaction is a valid use of the request URI, it's clear what resource it identifies. This transaction represents this resource as the document in the response. To make (valid) use of a URI in order to represent the resource it identifies by a document is to dereference the URI.

@@idea: We might try introducing formal notation at this point...

@prefix : <#>.
@prefix mediaTypeText: <...@@somehwere in IANA land...>.
@prefix HTTP: <...@@some specification of HTTP...>.
:req23 a HTTP:Transaction.
<http://example.org/aPath/myDoc> :req23 (mediatypeText:html "<html ... ").

end idea:.

How about HTTP 304 not modified responses?

This merits elaboration as well...

Suppose your user agent dereferenced http://example.org/aPath/myDoc and got a reply, dated 15:41, including some document. You browse around a bit this way, and not much later you follow another link to that same address. If your user agent is clever (and quality HTTP user agents are...) it will include an If-Modified-Since: ... 15:41 header field in this second request for myDoc, since it already has a reply in cache. Suppose the server knows that the resource hasn't changed (using local filesystem metadata); then the server will reply 304 Not Modified. This second transaction represents the resource identified by http://example.org/aPath/myDoc by the same document as the first.

If you like, you can look at the two transactions as one use of that identifier.

How about in FTP? What's the relationship between an ftp://... URI and a document I get via ftp?

Likewise, most FTP RETR transactions are valid uses of URIs that represent resources with documents.

[@@LIST almost represents a resource as a document; often, a proxy makes up an HTML document that represents the directory resource, and we trust that this is valid.]

Are resources just like files?

Sometimes.

In many case, the relationship between URIs, resources, and documents corresponds to the relationship between filenames, files, and file contents: each filename identifies one file. One file might be known by several filenames (think of shortcuts/symlinks/aliases). The file can have different contents at different times, but if you make a copy of the file's contents, you'll get just one thing at any time.

FTP and HTTP are designed to exploit this analogy, to make it easy to export filesystems into the Web.

But take care not to overgeneralize.

Is the relationship between resources and files 1-1?

No; for example there are at least 2 files, one called w3c_home.gif and one called w3c_home.png, used in dereferncing http://www.w3.org/Icons/w3c_home. Format negotiation is a technique that allows for graceful evolution of data formats (aka Internet media types) in the web. HTTP has specific support for it; see section @@ of the HTTP specification for details.

What about file: URIs... `file:/etc/hosts` is ambiguous, isn't it?

Each valid use of file:/etc/hosts is unambiguous. It's valid to use file URIs within one machine. And it may be valid to use file: URIs to refer to well-known files such as /etc/hosts, if you're sure/confident all the readers are using unix systems.

But if you write

See <a href="file:/home/user12/niftyStuff.html">my new nifty stuff</a>.

in an HTML document and publish that document on the public Internet, and somebody using a different machine reads it, they won't be able to dereference it; that's not a valid use of that URI. Their user agent may not detect the failure; it may find a file under that pathname and display it; but since this use of that URI is invalid, the document they see is likely irrelevant to what you meant.

What about personalized content; is http://my.yahoo.com/ ambiguous?

No; in each use, it refers to "the yahoo personlized content for the reader", whoever the reader is.

@@hmm... think more about this one; plusses and minuses...

What about relative URI references; `../myFile` is ambiguous, isn't it?

Again, each valid use of ../myFile is unambiguous; use of ../myFile with a base URI of http://example/dirA/this/stuff may refer to a different resource than use of ../myFile with a base URI of http://example/dirB/that/stuff; in the first case, its absolute form is http://example/dirA/myFile, but in the second case, its absolute form is http://example/dirB/myFile. Clearly these need not identify the same resource.

Are resources just like objects in C++ and Java?

Sometimes.

The relationship between URIs, resources, and documents is like the relationship between C pointers, memory cells, and values. The analogy with C++ or Java is even stronger: object references, objects, and values returned from method calls; in particular, the toString() or writeObject() method from Serializable.

But again, take care not to overgeneralize: while every Java object exports an equals() method, most resources do not.

[@@something about scale/scope: same object reference might point to different objects in different runs of a program; one run of a program is like one use of a URI; ideally, the whole web is one use, i.e. one run of a program.]

Does every resource have a URI?

Not necessarily. If you consider every real number a resource; clearly we can't give every real number a URI without collisions; there are only denumerably many URIs. (@@cite some explanation of cantor's argument or whatever for elaboration)

[@@see also: RDF bNodes stuff.]

Can the same resource have different URIs? Does `http://WWW.EXAMPLE/` identify the same resource as `http://www.example/`?

Yes, the HTTP specification [@@section] says that in any valid use of http://WWW.EXAMPLE/ and http://www.example/, they identify the same resource.

But don't count on consumers realizing that. Be consistent about how you write them if you expect consumers to realize you mean the same resource. [@@elaborate? talk about canonical forms?]

Does `http://example/BIGWORD` identify the same resource as `http://example/bigword`?

Not necessarily.

@@more

@@usage note: don't be silly enough to depend on case as the only distinction between your URIs. It's too fragile.

Can the same URI be used to identify different resources?

While each use of a URI reference is unambiguous, different uses of the same absolute URI reference may identify different resources. These situations are often either obscure or costly or both.

Typically, if I browse the web and I visit http://zoo.example/animals/tigerBob and I read a document about Bob, a tiger at the zoo, I like to think that I can make a link to that address in a document I publish, and when readers follow links from my documens to that address, they'll get a document about that same Bob. This is where the Web derives much of its value: using short strings to share resources. I like to consider my visit to that address and my reader's visit are in the same naming context. I don't mind if somebody revises tigerBob and my reader gets a slightly different document; but if they get a document about a tropical storm or something else totally irrelevant to the document I found when I was browsing, the Web hasn't been of much use.

Recall the discussion of HTTP 304 not modified transactions; the HTTP protocol shows that the two transactions are causally related; that the server knew about the first transaction when servicing the second. This is reasonably clear evidence[@@can we get rid of these weasel-words? Need to think abotu expires etc.] that the two transactions are one use of the URI http://example.org/aPath/myDoc. Though nothing in the HTTP header fields says so, my link to tigerBob was causally related to the response I got from zoo.example; the Referrer header field in my reader's request makes it clear that his/her request is causally related to these events as well. It's easiest if we can just look at the whole web as one use of all the URIs. [@@getting sloppy now.]

What happens when domains are re-assinged, sold, etc.?

If http://acme.example/pricelist was used to get a pricelist about ACME software's prices, then ACME software sold the domain to ACME pets, and it was then used to get a list of pet prices, then that same URI is refers to different resources in those two uses.

The use of this address by ACME pets interferes with anyone who wants to continue to use it to refer to ACME software's prices: historians, court archivists, news archive services, etc.

This is clearly sub-optimal.

@@elaborate: W3C gets a new member; it's a new W3C, in some uses

@@elaborate: conneg accross media types, where the fragments don't refer to anything compatible

Part @@: Questions with no organizational structure

Can I use an HTTP URI, with no fragment identifier, to name/refer to/identify my car?

Yes and No.

From http://www.ietf.org/rfc/rfc2616.txt (top of page 9): An HTTP URI denotes a "network data object or service [which] may be available in multiple representations (e.g. multiple languages, data formats, size, and resolutions) or vary in other ways." There's some weasel room in their, but not to include cars.

-- sandro in RDFIG. @@

TimBL's axiom about doc:work: { [] log:uri [string:startsWith "http:"] ; log:racine ?x} => { ?x a doc:Work}.

TimBL's conjecture: doc:Work ont:disjointFrom commonKnowledge:Car.

and...

I have two things with identity, so need two URIs. http://www.markbaker.ca/index.html identifies my HTML "web page"

-- MarkB

What happens when you set conneg up to serve text/html and application/rdf+xml from the same address? What does /myfavoritemovie#title identify then? An element, or a movie?

sbp 19 Jul 2002 20:56:49 +0100

How about a Robot?

@@did Roy give pointers?

XLink vs. RDF

pointing to elements vs. pointing to things.

Acks/Fodder/Related Work

Things versus their names, use/mention, map/territory. In the abstract/intro somewhere?

tel: URIs can have the same sort of validity failures as file: URIs... see Dierken 22Jul to www-tag. In 22Jun, RF says news: URIs do that too. hmm...

ISBN: copy of book vs. intellectual content of book

manifestation vs work

DesignIssues/Generic

use patents/copyrights as an example?

persistence guarantees: URN NIDs and domain names.

how many members in W3C? time.

www9 talk

why separate schemes for mid:stuff@domain and news:stuff@domain ? need new URI scheme for iCalendar items?

why do we use absolute naming in Web Architecture? surely relative naming is less constraining, no?