Jim Gettys

Date: Original email message 9 Feb. 1998; last changed 20 Oct. 1998

Up to Design Issues

Univeral Resource Identifiers -- Axioms of Web architecture

URI Model Conseqences

Introduction

The importance of sharing particular pieces of URI syntax has never been well understood or documented. Most URI design has been based on existing practice, and usually shares most of the generic syntax; there have been exceptions. Both intuition and reduction of code required to implement new URI schemes have encouraged significant uniformity of design, but recent understanding, and I hope this document, shows that it is vital to share as much syntax between URI schemes.

Without understanding of the consequences of particular choices, however, it has been unclear to designers of new URI schemes if a particular piece of syntax is appropriate to their application, and the consequences of not sharing a particular piece of syntax has not been clear, so resulting URI syntax design has sometimes been poor.

Both from a review of discussions on the URI mailing list as part of being asked about the proposed URI syntax and semantics specification, and as part of a meeting I recently attended, I've come to realize that there are two quite subtle consequences to (not) sharing components of URI syntax that have profound impact on the future evolutionary flexibility for the World Wide Web.

Note: NOTHING I am saying is different for any URx.

Documenting these issues to guide those involved URI scheme design has become vital to future World Wide Web evolution.

These have generically to do with:

There have been two views of URI syntax:

  1. more or less ``anything goes'' after the colon
  2. (more) ``uniform'' sharing of URI syntax to the extent it may make sense

The fundamental problem has been to distinguish the merits of each approach. The strongest arguments for "anything goes" have been

The strongest arguments for the "uniform" I had previously seen, were:

While I have generally preferred the "uniform" approach, I did not have a strong opinion. If this document succeeds in its intent, however, you will decide that the uniform approach is not only desirable, but vital for long term Web architecture.

So what are the consequences of following each path?

View of URI syntax as a Class Hierarchy

One way of framing the discussion is to view URI syntax is as though it were an object hierarchy. Then there are a set of methods that can be applied to a URI string:

Note that not all methods might necessarily apply to a particular scheme (analogous to an unimplemented method), and some schemes might define additional methods (subclass) .

In these terms, the debate can be framed as:

Class hierarchy design is known to be difficult! How do we evaluate the choice?

Consequences of URI's Being Embedded in Content

The utility of embedding links into document is certainly now clear to the world. But the fact that links are internally embedded into many data types (e.g. HTML, XML, Microsoft Word, Adobe PDF, etc.) have consequences. Note that below I mean "naming authority" to be scheme specific delegatee of part of a name space; for example, the www.w3.org in the URL http://www.w3.org/foo/bar/baz.html.

Digital Signatures

Digital signatures on content will increase even further the importance of maintaining bit-for bit integrity of content. Original signatures may require a private key only available at the time of signing, and may or may not be embedded into content in the same fashion as URI's. Therefore as signature technology deploys, if syntax differs gratuitously between schemes, it will strongly discourage old content being available via new schemes that might be deployed.

Impact on Opacity of Interfaces

These examples show that unless syntax is shared, new schemes will be very hard to introduce into the Web.

Conclusions

The sections above shows that the more sharing of basic URI syntax there is, the more likely (a set of) complex objects can be transported unmodified between different schemes (e.g. FTP to HTTP to HTTP/NG to URN, and to other schemes). Similarly, content can evolve to more useful types without breaking URI references, fragment syntax is shared among related content types (e.g. named anchors in documents). Digital signatures on content will further increase the importance of maintaining bit-for bit integrity of content.

Some naming systems lack the semantic meanings covered by the commonly used URI syntax, and sometimes those naming systems provide additional semantic meaning for those systems. For those naming systems in which parts of the URI syntax do not apply, it is clearly acceptable in my view to ignore that part of the syntax. I hope this document convinces you, however, that where the semantic meaning of name components are identical, that mapping them into the a common URI syntax in fact has major medium and long term benefits to the World Wide Web. For those who are working on facilities which add new semantic meanings that might be shared between schemes, I hope this document convinces you it is worth working on defining what that common syntax should be.

If the same content cannot be served up under alternate schemes, or moved to future schemes used in the Web, it will greatly inhibit introduction of new schemes into the Web. If Web software cannot be written without intimate intertwining of knowledge between components, and therefore updating to introduce new schemes or content types, it will greatly inhibit introduction of new schemes and software into the Web.

If URI syntax, therefore, is gratuitously different for the same semantic meaning, it will strongly discourage future innovation in the World Wide Web. The more random URI syntax is between schemes, the more Web evolution will inhibited, the more programmers and protocol designers we'll keep employed kludging around... (job security!). But since there is enough work to go around in the Web, I believe it is clear that unity of URI syntax for semantically equivalent constructions is essential for the future health of the World Wide Web.

A single URI specification that covers general URI syntax, along with guidance on how to design new URI schemes (and the consequences of different design decision), probably as a separate new document, is preferable to splitting the URI spec into several specifications (e.g. scheme, vs. independent URL and URN specs). Each URI scheme should be able to reference this single syntax and semantics specification, and it should be able to do so and make clear which components of the generic URI syntax applies for that scheme (and which components do not!). The November draft submitted by Fielding is closest to this model, but does need some further work; e.g. the host part of the document needs clear deliniation from the rest of the URI spec, so that it is clear that this is additional syntax which is common in a number of schemes, but not at all inherent in URI syntax.

Jim Gettys
Digital Equipment Corporation
Visiting Scientist, World Wide Web Consortium