The use of Metadata in URIs

1 Introduction

This finding addresses two related question:

What, if anything, can be inferred from a URI used to identify a resource?
What information about a resource can or should be embedded in a URI used to identify that resource?

The first question is focussed on people and software making use of URIs assigned outside of their own authority (observers). The second question is focussed on people and software acting in the role of or on behalf of a URI assignment authority (authorities) for URI assignments within the scope of that authority. The opaque nature of URIs is considered from the point of view of observers and authorities.

1.1 Summary of Prinicples

People and software making use of URIs assigned outside of their own authority (i.e. observers) MUST NOT attempt to infer properties of the referenced resource except as licensed by relevant normative specifications or by URI assignment policies published by the relevant URI assignment authority.
Relevant normative specifications include the URI specification [URI], registered URI scheme specifications (see IANA URI Scheme Registry) and other normative specifications that specify structured use of URI's or URI components.
For example, based solely on inspection of a URI it is unsound to infer:
1. that a retrival operation on a URI that ends .html will return an HTML representation of the resource a content-type of text/html.
2. that a resource identified by a URI whose authority component ends .ca is hosted in Canada or operated by a Canadian organisation.
A URI assignment authority MAY publish specifications detailing its URI assignment policies. Policies may detail the use of resource properties (version, creation date, author) in making URI assignments. Policies may detail the structure and semantics of URI query components served by resources under the control of the authority.
People and software using URIs assigned by that authority (i.e. observers) MAY make use of such published information. For example the structure of URIs used to identify W3C technical reports is documented in [PUBRULES]. However, observers are cautioned that assignment policies are not generally subjected to standardization and may be changed by the relevant authority at any time. Software programmed on the basis of such a policy is at risk of becoming obsolete.

Editorial note

I'm not sure how or whether we should account for specifications other than URI scheme specifications that might impose some structure on URIs. In the case of XML Schema Component Designators the structuring constraints only impact the fragment identifier. Likewise for identifying abstract WSDL components in the current WSDL draft - although one of the options being considered would be to apply structure with the URI path component. Also, there seems to me to be a difference between imposing structure on fragments (because their interpretation is media-type dependent) and on the path component.

Editorial note
Members of the TAG have suggested that the finding end at this point and that discussion in sections 2 and 3 be deleted. The editor would appreciate feedback on this suggestion.

Editorial note
Members of the TAG have suggested that this finding be reduced simply to a statement to the effect of "Don't peek inside URIs". The editor would appreciate feedback on this suggestion.

2 Deriving Information from URI Syntax and Structure

Editorial note
The current draft of RFC2396bis [URI] is used throughout as the referenced URI specification.

The generic syntax of URIs is specified in [URI]. At the top level a URI is divided into four syntactic components, scheme, hier-part, an optional query, and an optional fragment.

2.1 Scheme Component

The scheme component of a URI is the primary determinant of the syntax and semantics of the remainder of a URI. It determines the relevant URI scheme specification which should be registered in the IANA URI Scheme Registry. A URI scheme specification may constrain the sort of resource that identifiers assigned using that scheme reference. For example,

[RFC2368] states that "The mailto URL scheme is used to designate the Internet mailing address of an individual or service."
[RFC1738] states "The FTP URL scheme is used to designate files and directories on Internet hosts accessible using the FTP protocol (RFC959)."
[RFC2392] defines the intended use of the mid and cid URI schemes for referencing messages and message parts.

A URI scheme may also delegate the specification of such constraints to other specifications. For example the URN Specification,[RFC2141], introduces the concept of URN Namespaces and [RFC3406] then delegates the specification of URN assignment procedures, contraints on the type of resource being identified and other syntactic and semantic constraints (if any) to the associated URN Namespace specification.

Thus, either directly or via delegation to other specifications, some URI schemes enable the type of resource being referenced to be determined, others do not.

2.2 Authority and Path Components

In the case of URI's where the hier-part component enables some authority to be identified and that authority (which may be a specification) describes the syntax and semantics of identifiers assigned under that authority, it is possible to use such a description to derive information about a resource. For example, on the basis of the W3C Publication Rules for technical reports [PUBRULES] it is legitimate to conclude that the following about the resource identified by the URI http://www.w3.org/TR/2003/WD-xquery-20030502/:

The referenced resource is a Working Draft of a W3C technical report.
The referenced resource was published on 2nd May 2003.
The shortname of the referenced resource is "xquery".
The most recent published version of the technical report is accessible at http://www.w3.org/TR/xquery

A common mistake is to make inferences based on the last segment in the path component of a URI. For example, it is common to assume that a URI with a path that ends in ".html" identifies an HTML document, or that a path ending in ".jpeg" identifes an image encoded acording to the JPEG specification. These inferences are not licensed by the [URI]. When a transfer protocol provides a means to convey media-type information, that is the authoritative determinant of media content-type (and inconsistencies are an error that should not be silently ignored (see [MIMEOverride])

2.3 Query Component

HTML, XHTML and XFORMs all define mechanisms whereby a client can construct the query portion of a URI from information submitted in a web form. The transformation of completed user agent forms into URI is a powerful means to enable the bookmarking of queries that may then be shared with others or repeated. When the query components of a URI are constructed either by the action of completing a form or executing a script that orignated from the same authority as the constructed reference, the structure of the query remains opaque to the end-user. The user-agent that constructed the URI did so in accordance with a specification it received, as a form or a script, from the authority that resolves the resulting reference. That authority retains all the normal freedoms to organise the way it uses its URI space, to create new query parameters, to change scripts and forms etc.

However, if the authority also publishes the syntax and semantics of the query parameters that it uses, 3rd parties may independently construct URI with particular semantic intent. 3rd parties that find such constructed URI useful will create content and applications that depend upon their structure. This will then either limit the freedom for the assignment authorities to evolve there site (modulo 'Cool URI's don't Change' [CoolURI]) without causing breakages elsewhere or it places a maintainence burden on the dependent applications and content.

2.4 Fragment Component

[URI] defines fragment identifiers in the context of retrival operations where the fragment identifier is resolved in a manner specified by the media-type of the resulting representation. Fragment identifiers are also used in contexts where they merely play a role as part of an identifier in systems where no retrieval is intended. In general, in the absense of a media-type it is not possible to infer properties of a resource or its representations from the fragment identifier component of a URI.

3 Structuring URI Assignments

People and software making URI assignments may only assign URIs for which they are the relevant authority. Authority to make URI assignments is delegated from the URI specification, [URI], to URI scheme specfications. Registered URI schemes are listed in the IANA URI Scheme Registry. URI schemes themselves may futher delegate the authority to assign URIs under that scheme to other specifications or to people or software acting in the role of URI assignment authorities.

URI assignment authorities should not reassign URIs as this leads to ambiguity over what a given URI identifies (see [CoolURI]).

3.1 Scheme Component

The assignment of URIs from a particular URI scheme should respect any constraints on the type of resource identified imposed by the relevant URI scheme specification. The choice of scheme will be influenced by:

The available resource access mechanisms: HTTP, FTP, filesystem.
The type of resource identified by the assigned URI.
The authorities the resource owner has to assign URIs.

3.2 Authority and Path Components

A URI assignment authority must operate within the constraints imposed by the delegation path(s) which established its authority. Beyond that the authority is free to organise the URI space under its control in any manner it chooses.

An assignment authority may make use of resource properties in the construction of URI that identify the resource. Properties used in the construction of resource identifiers should be static with respect to changes in the state of the resource, it is not very useful if the identify of a resource varies with its current state.

3.3 Query Components

Query components are often used to carry identifying information in the form of keyword/value pairs. Many references that contain query components arise through the use of forms or from the client side execution of scripts. Both forms and scripts generally originate under the control of the assignment authority without the need to publish a specification of the structure and semantics of the associated queries.

However, URIs containing queries are also propagated as bookmarks and in email messages. Assignment authorities should take care to maintain the URI assignments under their authority, [CoolURI]. Changing the spelling or value space of keywords may result in the failure of references bookmarked by others, or software that has been programmed based on any published details of the structure and semantics of queries strings.

The URI specification [URI] makes no statement about the equivalence of URI that vary simply in the ordering of query keyword/value pairs. URI that differ only in the ordering of query keyword/value pairs are different URI, however in many cases they may identify the same resource.

Editorial note
Don't know if this is correct, but couldn't find a mention. Probably ought to say something about whether or not inconsistent ordering of query parameters has any significant effect on Web performance - due to missed cache hit opportunities.

3.4 Fragment Components

Editorial note
Need to add some material here, but I think it risks getting tangled up with httpRange-14 and fragmentInXML-28. Going to pause for now and await inspiration/feedback.

4 Conclusions

It is legitimate for assignment authorities to encode static identifying properties of a resource, e.g. author, version, or creation date, within the URIs they assign. This may contribute to the unique assignment of URIs. It may also contribute to the use of efficient mechanisms for dereferencing resources within origin servers e.g. use of session-ids within URIs.

Assignment authorities may publish specifications detailing the structure and semantics of the URIs they assign. Other users of those URIs may use such specifications to infer information about resources identified by URI assigned by that authority.

URI scheme specifications may make constraints about the type of resources identifiers assigned under that scheme may reference. Also, they may delegate the right to make such constraints to other specifications. Assignment authorities should honor such constraints in the assignments that they make. Other users of those URIs may make use of any constraints specified by such delegation chains rooted in the URI specification [URI] to infer information about a referenced resource.

People and software using URIs assigned outside of their own authority should make as few inferences as possible about a resource based on its identity and inferences arising from delegated authority. The more dependencies a piece of software has on particular constraints and inferences, the more fragile it becomes to change and the lower its generic utility.

5 References

@@TODO@@ This list needs pruning once the document is finished.

RFC1738: "Uniform Resource Locators (URL)"; IETF; RFC1738; T. Berners-Lee, L. Masinter, M.McCahill; December 1994 (See http://www.ietf.org/rfc/rfc2616.)
RFC2141: "URN Syntax"; IETF; RFC2141; R. Moats; May 1997 (See http://www.ietf.org/rfc/rfc2141.)
RFC2368: "The mailto URL scheme"; IETF; RFC 2368; P. Hoffman, L. Masinter, J. Zawinski; July 1998 (See http://www.iana.org/rfc/rfc2368.)
RFC2392: "Content-ID and Message-ID Uniform Resource Locators"; IETF; RFC 2392; E. Levinson; August 1998 (See http://www.iana.org/rfc/rfc2392.)
RFC2396: "Uniform Resource Identifiers (URI): Generic Syntax"; RFC2396; IETF; T. Berners-Lee, R. Fielding, L. Masinter; August 1998 (See http://www.ietf.org/rfc/rfc2396.)
RFC3406: "Uniform Resource Names (URN) Namespace Definition Mechanisms";RFC3406; IETF; L. Daigle, D.W. van Gulik, R. Iannella, P. Faltstrom; October 2002 (See http://www.ietf.org/rfc/rfc3406.)
CoolURI: "Cool URIs don't change"; W3C; Tim Berners-Lee; 1998 (See http://www.w3.org/Provider/Style/URI.html.)
MIMEOverride: "Client handling of MIME headers"; W3C; Draft TAG Finding ; I.Jacobs; May 2003 (See http://www.w3.org/2001/tag/doc/mime-respect.html.)
XsdComponents: "XML Schema: Component Designators"; W3C; Working Draft; 09 January 2003 (See http://www.w3.org/TR/2003/WD-xmlschema-ref-20030109/.)
WSDL12: "Web Services Description Language (WSDL) Version 1.2"; Appendix C "URI References for WSDL constructs" ; W3C; Working Draft; 3 March 2003 (See http://www.w3.org/TR/2003/WD-wsdl12-20030303/#wsdl-uri-references.)
PUBRULES: "Publication Rules"; W3C; May 2003 (See http://www.w3.org/2003/05/27-pubrules.html.)
URI: "Uniform Resource Identifier (URI): Generic Syntax"; IETF; T. Berners-Lee, R. Fielding, L. Masinter; Currently being revised. (See http://www.apache.org/~fielding/uri/rev-2002/rfc2396bis.html.)

The IETF Internet Draft draft-fielding-uri-rfc2396bis-03 is expected to obsolete RFC 2396, which is the current URI standard. "Architecture of the World Wide Web" uses the concepts and terms defined by draft-fielding-uri-rfc2396bis-03, preferring them to those defined RFC 2396. The TAG is tracking the evolution of draft-fielding-uri-rfc2396bis-03.