What Does a URI Identify?

TAG Draft 15 Mar 2002

This version:
http://www.w3.org/2001/tag/doc/identify (HTML, XML)
Norman Walsh, Sun Microsystems, Inc. <Norman.Walsh@Sun.COM>
Stuart Williams, Hewlett Packard, Inc. <skw@hplb.hpl.hp.com>


A one-page answer to the question "what does a URI identify?" @@say more

Status of this Document

This document has been developed for discussion by the W3C Technical Architecture Group.

This document is the work of the editors. It is a draft with no official standing. It does not necessarily represent the consensus opinion of the TAG and it may not even represent the consensus opinion of the editors.

Comments may be directed to the W3C TAG mailing list www-tag@w3.org (archive).

Publication of this document by W3C indicates no endorsement by W3C or the W3C Team, or any W3C Members.

Table of Contents

1 Introduction
2 What Does a URI Identify?
    2.1 What about URLs?
    2.2 Resources
    2.3 Properties of Resources
3 Syntax of URIs
    3.1 What about Fragment Identifiers?
4 Semantics of URIs
5 References
    5.1 Normative References
    5.2 Non-Normative References

1 Introduction

In order for two or more parties to communicate meaningfully, they must have some shared frames of reference. They must speak a common language, for example, and they must use a shared vocabulary. They must also have some means to identify the things they are discussing.

In the physical world, humans in close proximity can identify things with informal, relative identifiers: "that yellow car" or "the oval chair". Communicating over greater distances, we can rely on shared experiences: "the Monty Python "parrot sketch" " or "the second edition XML specification". This works in part because we have a lot of shared experiences and in part because there are such great differences between objects in the physical world; it would be a rare person indeed that would have difficulty distinguishing between an oval chair and the second edition XML specification.

But in web space, two factors come into play that make such informal identifiers impractical: first, most of the objects that humans want to talk about are documents on the web which are much less physically distinct than chairs and cars and second, the participants involved in the communication are not always human beings, sometimes they are software agents of one form or another and they simply don't cope with informality that well.

In web space, we need more precise identifiers and URIs, Uniform Resource Identifiers, satisfy that requirement.

2 What Does a URI Identify?

On the web URIs identify resources."Any information that can be named can be a resource." [RFC2396]. In fact, this relationship can be taken as axiomatic: if a resource has a URI, it is identifiable on the web. If it does not, it is not.

Editorial note  
I don't think this is entirely the case. If we want to cover things that may be represented as blank-nodes in RDF. These are resources that may be described by relationships to other resources, but that may be left unnamed. (skw)

For a large class of resources: web pages, email addresses, etc., this relationship is fairly obvious. What's less obvious is that this relationship applies to more abstract resources as well. The way one talks about a person, or real estate, or love in a way that network agents can process it, is by giving it a URI.

The broader question of how your URI for love and my URI for happiness are related is a different and orthogonal question.

2.1 What about URLs?

URLs (uniform resource locators), like URNs (uniform resource names), are a subset of the general class of uniform resource identifiers. The distinction between URI, URL, URN, and the rest of the sometimes described URx identifiers is often more confusing than useful. We'll confine ourselves to the term "URI", meaning all of them.

2.2 Resources

In general resource is a time varing conceptual mapping to a set of entities or values which are equivalent [Fielding]. For example the resource identified by the URI http://www.w3.org names the concept of the W3C home page which is itself a mapping to set of values returned as hypertext. The value returned (the hypertext) that changes frequently as new infomation displaces old information from the home page. The set of values mapped by a resource are equivalent resource representations and/or resource identifiers (giving further indirection or redirection). Dereferencing a resource identifier yields a representation of the current value of the referenced resource. At some time, t, the set of values that a resource maps to may be empty, which allows a concept to be identified before a realisation of the concept exists (or indeed after it has been retired).

For example, for the W3C technical report collection it is common practice to assign a URI which always references the current version of a given technical report. In addition, each published version of a technical report is assigned a distinct URI that references that specific version. These two URIs reference two different resources or two different concepts: a specific version of a technical report and the current version of that same report. At some point in time dereferencing either URI may yield the same resource representation. However, at some later instant dereferencing the URI that references the current version of the technical report may yield a different set of representations than deferencing the resource referenced by the version specific URI. With the version specific URI there is a commitment, as a matter of W3C policy, that the set of resource representations referenced by that URI will not change over time, whereas, with the URI that references the current version of the report, there is a commitment, as a matter of W3C policy, that the resource representations referenced by that URI will always represent the most up-to-date published version of the report.

The important point to note is that in general a resource is a time varying mapping to a value, and not simply the value returned by deferencing the resource at a particular moment in time.

A further point to note about resources is that their identifiers expose nothing more than their identity (see Opaque below). It is a matter for the authority that assigns an identifier to a resource to say what that resource means and what commiment it makes to sustainin the meaning of that resource.

[RFC2396] is clear that "A resource can be anything that has identity". This is not a closed definition. Are there more things that can be regarded as resources than just those with assigned URIs (or URI references)? RDF provides the ability to described resources by their relationship to one another which leads to the notion of existentally qualified resources. For example, there exists a person whose internet mailbox is identified by the URI mailto:timbl@w3.org. This identifies the person of Tim Berners-Lee by reference to the URI of his internet mailbox without it being necessary to assign a URI to identify the concept of the person Tim Berners-Lee. Of course dereferencing such a resource might prove to be an interesting challenge.

2.3 Properties of Resources

It is appealing to ask when two resources are the same resource, or perhaps more particularly when two resource identifiers identify the same resource. These are difficult questions. Two different URI's may identify the same resource, but it is only the authorities that asssign those URIs that can make the commitment to them identifying the same resource.

However, as they say "Cool URIs don't change" [Cool] [Axioms]. There are strong social expectations that once a URI is assigned to identify a particular resource, then it should continue indefinitely to refer to that same resource or concept. This is of course best practice, but it is a matter of policy and commitment on the part of authorities assigning URIs rather than a constraint imposed by technological means.

We are dealing here with two time dependent mappings. Firstly a time dependent maping between and identifier and a resource and then a time dependent mapping as describe above, between a resource and a set of equivalent values. The binding between an identifier and a resource is not fixed for all time and neither is the mapping between a resource and its current value. At some instant in time two URIs may reference the same resource, and at some later instant they may reference different resources. It is the authority controlling the assignment of a given URI to identify a particular resources that determine what resource a given URI references at a given time. Some URI schemes may give stronger guarantees about the temporal stability of the URI to resource mapping. Some organisations as a matter of policy may give stronger guarantees than those intrinsic to the schemes that they use for assigning URIs.

The assignment of meaning and resources to an identifier comes under the control of some authority (much as a parent assigns a name to a child, although giving it meaning is perhaps a little harder). We can only know that two identifiers reference the same resource because the authorities that assign the identifiers assert (directly or indirectly) that they identify the same resource, and even then, such assertions may not hold for all time.

3 Syntax of URIs

In high-level terms, a URI consists of a scheme (http:, urn:, isbn:, ...) followed by a scheme-specific string, and an optional fragment identifier. Some schemes are hierarchical, allowing for both relative and absolute URIs, and some are not, allowing only absolute URIs.

Although knowledge of the scheme provides some information about the components of the scheme-specific string, for example, that absolute URIs in the http: scheme begin with a DNS name, it is generally inappropriate to make assertions about the content of the resource identified by a URI from the content of the URI itself. In particular, it is an error to assume that a URI that happens to end with the string " .html " contains an HTML document.

3.1 What about Fragment Identifiers?

If a URI contains an sharp character (a " # "), the string that follows the " # " is a fragment identifier. Fragment identifiers are a mechanism for identifying part of a resource. For example, in a URI that retrieves an HTML document (a resource representation), the fragment identifier #foo can be used to reference the element with the ID " foo " within that document.

Editorial note  
There seems to have been some controversy over the terms URI and URI Reference. The terminology of [RFC2396] is to define a URI Reference as an optional URI (absolute or relative) followed optionally by a # and a fragment identifier. I feel that in our architectural writings that we need to make a consistent choice over the use of the terms URI and URI reference. URIs seem to cover only a subset of URI references and URI's with frag-ids seem to reference parts of a resource representation modulo mapping between qnames and URI references and RDFs use of # characters in namespace names and URIs with frag ids to name graph nodes that RDF also call resources. (skw)

This means that in general, it's not possible to determine what a fragment identifier means without retreiving the resource into which it points.

The fragment identifier identifies some sub-part of a resource representation. The syntax and interpretation of a fragment identifier is determined by the MIME media type of a resource representation. This is considered a design flaw [Fragments].

A URI that consists of only a fragment identifier (i.e, one that begins with a " # ") always points into the document that contains the URI, irrespective of the effective base URI.

4 Semantics of URIs

URIs have a small number of semantic properties independent of the resources that they identify. The URIs of a particular scheme may have additional semantics, that's a question for the specification that defines each scheme.

URIs are Uniform

In any context that allows a URI, any URI may be used. It is an error to say that only URIs of a specific scheme are allowed in a certain context.

Uniformity "...allows different different types of resource identifier to be used in the same context, even when the mechanisms used to access those resources may differ; it allows uniform semantic interpretation of common syntactic conventions across different types of resource identifiers; it allows intoduction of new types of resource identifiers without interfering with the way that existing identifiers are used; and, it allows the identifiers to be reused in many different contexts, thus permitting new applications or protocols to leverage a pre-existing, large and widely-used set of resource identifiers." [RFC2396]

URIs are Universal

URIs may be used to identify any identifiable thing, anywhere. Also, resource of significance should be assigned a URI. [Axioms]

An absolute URI always means the same thing, regardless of the context in which it occurs. It is an error to assert that you can construct a context in which absolute URIs have different meaning than they have outside that context. (The same holds for relative URIs, except that context may change the effective base URI.)

Editorial note  
I'm not sure that Universal is intended to imply some stability of meaning so much as they may be applied to anything that is identifiable. (skw)
URIs are Opaque

It is an error to assert properties about the content of a resource based solely on the content of the URI used to identify it.[Axioms] For example, it is an error to infer anything about the nature of a resource or its available representations from the presense of a a trailing ".html", ".asp" or ".png" or from the presense of strings like "servlet" or "cgi-bin" within a URI.

URIs are Consistent

The resource identified by a particular URI should always be "the same", when it is identified by that URI. This does not mean that the stream of bits associated with a URI (if, in fact, there is one) can never change. The notion of "sameness" cannot be absolutely identified.

For some URIs, an unchanging stream of bits is entirely appropriate, but others, the resource identified by the URI for today's weather or the current time of day, for example, are expected to vary even though they remain the same in perfectly understandable ways.

URIs are Not Unique

Although the resource identified by a URI should be consistent, it does not follow that different URIs must always refer to different resources. It is perfectly reasonable for a resource to be identified by several different URIs.

URIs are Transcribable

The syntax of a URI reference is defined in terms of a sequence of characters. Although these characters are drawn from those expressable using the US-ASCII character set, their primary form is as a sequence of characters and not as a sequence of octets under some character set encoding exchanged between computers or stored in computer files. This makes URI references transcribable in such away that they can be passed around as part of social communication every bit as easily as they can be exchanged by technological means. They have intruded our daily life. They appear in television advertising; they are spoken on the radio or over the telephone; they are printed in newspapers and books; jotted on pieces of paper and sent in letters. Oh yes, and they appear occasionally on Web sites.

5 References

5.1 Normative References

IETF "RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax", T. Berners-Lee, R. Fielding, L. Masinter, August 1998. (See http://www.ietf.org/rfc/rfc2396.txt.)

5.2 Non-Normative References

"Universal Resource Identifiers - Axioms of Web Architecture", T. Berners-Lee, living document dated December 1996. (See http://www.w3.org/DesignIssues/Axioms.html.)
"Fragment Identifiers on URIs", T. Berners-Lee, living document dated April 1997. (See http://www.w3.org/DesignIssues/Fragment.html.)
"Principled Design of the Modern Web Architecture", R.T. Fielding and R.N. Taylor, UC Irvine, (See http://www.cs.virginia.edu/~cs650/assignments/papers/p407-fielding.pdf.)
"Cool URI's don't change." T. Berners-Lee, W3C, 1998 (See http://www.w3.org/Provider/Style/URI.html.)