URI Model Conseqences

Introduction

The importance of sharing particular pieces of URI syntax has never been well understood or documented. Most URI design has been based on existing practice, and usually shares most of the generic syntax; there have been exceptions. Both intuition and reduction of code required to implement new URI schemes have encouraged significant uniformity of design, but recent understanding, and I hope this document, shows that it is vital to share as much syntax between URI schemes.

Without understanding of the consequences of particular choices, however, it has been unclear to designers of new URI schemes if a particular piece of syntax is appropriate to their application, and the consequences of not sharing a particular piece of syntax has not been clear, so resulting URI syntax design has sometimes been poor.

Both from a review of discussions on the URI mailing list as part of being asked about the proposed URI syntax and semantics specification, and as part of a meeting I recently attended, I've come to realize that there are two quite subtle consequences to (not) sharing components of URI syntax that have profound impact on the future evolutionary flexibility for the World Wide Web.

Note: NOTHING I am saying is different for any URx.

Documenting these issues to guide those involved URI scheme design has become vital to future World Wide Web evolution.

These have generically to do with:

Constraits imposed by Content on the Web
Constraints imposed by need for information hiding, to enable software in the Web to remain modular and extensible.

There have been two views of URI syntax:

more or less ``anything goes'' after the colon
(more) ``uniform'' sharing of URI syntax to the extent it may make sense

The fundamental problem has been to distinguish the merits of each approach. The strongest arguments for "anything goes" have been

the constraints of the syntax can make things difficult for scheme designers
existing syntax of identifiers can be adopted without further thought, which are already in widespread use and familiar to those who use them

The strongest arguments for the "uniform" I had previously seen, were:

general simplicity
fewer parsers to build
and general design intution that uniformity is better than chaos

While I have generally preferred the "uniform" approach, I did not have a strong opinion. If this document succeeds in its intent, however, you will decide that the uniform approach is not only desirable, but vital for long term Web architecture.

So what are the consequences of following each path?

View of URI syntax as a Class Hierarchy

One way of framing the discussion is to view URI syntax is as though it were an object hierarchy. Then there are a set of methods that can be applied to a URI string:

Scheme(URI),
Fragment(URI),
Relpath(URI), etc.

Note that not all methods might necessarily apply to a particular scheme (analogous to an unimplemented method), and some schemes might define additional methods (subclass) .

In these terms, the debate can be framed as:

whether different URx's inherit from Object (``anything goes''),
or if they inherit from a basic ``uniform'' URI syntax.

Class hierarchy design is known to be difficult! How do we evaluate the choice?

Consequences of URI's Being Embedded in Content

The utility of embedding links into document is certainly now clear to the world. But the fact that links are internally embedded into many data types (e.g. HTML, XML, Microsoft Word, Adobe PDF, etc.) have consequences. Note that below I mean "naming authority" to be scheme specific delegatee of part of a name space; for example, the www.w3.org in the URL http://www.w3.org/foo/bar/baz.html.

If fragment syntax (to the extent of understanding the URI is a fragment), isn't shared between two schemes, (e.g. ``<A href=``#foo''>''), you can't move individual completely self referential documents between schemes without rewriting the document. In the Web, the fragment syntax is a property of the media type, and evaluted by the client.
If fragment syntax is not shared between different media types of the same capability (e.g. HTML, XML, Word, or image types such as GIF, JPEG, PNG) then you can't have a URI reference that can evolve to superior media types as they become available, or even likely work properly today with content negotiation.
If relative syntax (to the extent of understanding the URI is relative, and what part of the URI string is relative) isn't shared between two schemes, (e.g. ``<A href=``foo''>''), you can't move sets of documents that are internally self referential between schemes without rewriting.
If ".." syntax as a path component in relative URI's isn't shared between schemes, you can't easily have sets of document sets and refer to them between schemes without rewriting.
If / syntax (to the extent of understanding that the URI refers to a path relative to the current naming authority) isn't shared, you can't have multiple sets of documents easily be moved up or down in a relative heirarchy of names and share a common set of documents between them, without rewriting the content, shared either in that scheme or between schemes. The best example is a site that has a common set of GIF's, JPEG and PNG images, and you want to reorganize the site changing the depth of a subtree from one depth to another, or from one directory to another where the depth isn't the same.
If naming authority syntax (e.g. what comes after "//" in most URL schemes) and relative path syntax is shared, to the extent of understanding that the URI has a naming authority, and what part of the URI string is the naming authority vs. path), isn't shared between two schemes, you can't share identical name spaces and serve them up via different schemes. (The naming authority syntax is a property of the scheme). The fact that HTTP, and FTP have the same syntax, for example, has often been exploited by sites transitioning from ftp archive service to HTTP archive service so that the URL's can be identical between schemes except for the scheme; the same content can be served via two schemes simultaneously.
If query syntax (to the extent of understanding the URI has a query, and what part of the URI string is the query) isn't shared between two schemes ( the syntax is a property of the server, rather than the client).
If the encoding of non-URI characters into URIs (to the extent that URIs are derived from such characters) isn't unified and shared between schemes and servers, you can't move such documents between schemes and servers. [this item added by Martin Dürst; see background]
There are a few other pieces of URI path syntax for which this document does not explore the consequences, but I think you can work it out for yourself, given these examples.

Digital Signatures

Digital signatures on content will increase even further the importance of maintaining bit-for bit integrity of content. Original signatures may require a private key only available at the time of signing, and may or may not be embedded into content in the same fashion as URI's. Therefore as signature technology deploys, if syntax differs gratuitously between schemes, it will strongly discourage old content being available via new schemes that might be deployed.

Impact on Opacity of Interfaces

If fragment syntax is not solely media type dependent, (e.g. depends on the scheme), then introducing a new scheme would (potentially) require that each media viewer be updated for that scheme. This is likely to be a prohibative amount of work.
Similarly, to be able to introduce new schemes into the web, without having to modify all URI access code in applications, the URI parsing code in applications must be able to remove the fragment from the base URI, or it will have to be updated for each scheme.
Relative URI parsing and following of links cannot also be independent of scheme unless relative URI syntax is shared, and similarly, user agent and other programs that follow relative links would have to be updated for a new scheme to be introduced.

These examples show that unless syntax is shared, new schemes will be very hard to introduce into the Web.

Conclusions

The sections above shows that the more sharing of basic URI syntax there is, the more likely (a set of) complex objects can be transported unmodified between different schemes (e.g. FTP to HTTP to HTTP/NG to URN, and to other schemes). Similarly, content can evolve to more useful types without breaking URI references, fragment syntax is shared among related content types (e.g. named anchors in documents). Digital signatures on content will further increase the importance of maintaining bit-for bit integrity of content.

Some naming systems lack the semantic meanings covered by the commonly used URI syntax, and sometimes those naming systems provide additional semantic meaning for those systems. For those naming systems in which parts of the URI syntax do not apply, it is clearly acceptable in my view to ignore that part of the syntax. I hope this document convinces you, however, that where the semantic meaning of name components are identical, that mapping them into the a common URI syntax in fact has major medium and long term benefits to the World Wide Web. For those who are working on facilities which add new semantic meanings that might be shared between schemes, I hope this document convinces you it is worth working on defining what that common syntax should be.

If the same content cannot be served up under alternate schemes, or moved to future schemes used in the Web, it will greatly inhibit introduction of new schemes into the Web. If Web software cannot be written without intimate intertwining of knowledge between components, and therefore updating to introduce new schemes or content types, it will greatly inhibit introduction of new schemes and software into the Web.

If URI syntax, therefore, is gratuitously different for the same semantic meaning, it will strongly discourage future innovation in the World Wide Web. The more random URI syntax is between schemes, the more Web evolution will inhibited, the more programmers and protocol designers we'll keep employed kludging around... (job security!). But since there is enough work to go around in the Web, I believe it is clear that unity of URI syntax for semantically equivalent constructions is essential for the future health of the World Wide Web.

A single URI specification that covers general URI syntax, along with guidance on how to design new URI schemes (and the consequences of different design decision), probably as a separate new document, is preferable to splitting the URI spec into several specifications (e.g. scheme, vs. independent URL and URN specs). Each URI scheme should be able to reference this single syntax and semantics specification, and it should be able to do so and make clear which components of the generic URI syntax applies for that scheme (and which components do not!). The November draft submitted by Fielding is closest to this model, but does need some further work; e.g. the host part of the document needs clear deliniation from the rest of the URI spec, so that it is clear that this is additional syntax which is common in a number of schemes, but not at all inherent in URI syntax.

Jim Gettys
Digital Equipment Corporation
Visiting Scientist, World Wide Web Consortium