When HTTP Goes Bad - DRAFT

Jonathan Rees
4 June 2010

This memo considers three ideas applying to the Web, not necessarily as serious proposals (although given enouragement they could be turned into such) but as thought experiments or fantasies meant to sharpen the discussion of the "meaning" of URIs and other current issues of web architecture. The first fantasy is the idea that a URI's meaning is in how it is used, not what it "identifies" or "names". The second is the prospect of second sourcing for URI behavior. The third is the idea of encyclopedia-style documentation for URIs.

Use and expectation

The first fantasy is that the meaning of a URI might be defined by how the URI is used, and how we expect it to be used, not what it "identifies".

We use URIs because we have certain expectations around what happens after we use them. Suppose we have some expectation of what happens when we use a URI on the Web, and the Web doesn't meet that expectation. For example, one might expect 'GET http://w3.org/' to yield the W3C home page. One may even have invested in it doing so, e.g. by going to the trouble of putting links in email messages or HTML files. Suppose it doesn't? E.g. suppose it yields a 404, say, or a phishing site? What do we do?

Some preventive steps:

  1. Don't do it - link only using URIs that you control.
  2. Arrange for consequences of failure to be inverse to risk. I.e. don't use a URI in any consequential way.
  3. Extract a promise from the URI owner to meet the expectation (here I define "URI owner" to be the party who has the ability to control what GET U and other HTTP methods do, in the usual DNS/HTTP way).
  4. Obtain a service contract / threaten with penalties.

Insurance and penalties can make remediation affordable, but are not a substitute for having a system that works as expected.

Detecting failure is clearly important, but I won't dwell on it. In many cases the problem will be detected quickly, while in others it may be worthwhile to obtain some kind of verifiability, such as redundancy (e.g. checksum) or digital signatures.

Remediation means debugging and fixing the problem so that expectations are met, or else giving up on URI behavior and fixing whatever it is that leads up to use of the URI (e.g. fixing the links at the source).

The apparatus of the web has many components controlled to varying degrees by many entities, potentially permitting a repair intervention at many different points. If a component is judged responsible for the failure and can be fixed either directly or via social or economic pressure, great, but otherwise a workaround is necessary. Intervention points inlude:

  1. server configuration or software
  2. content
  3. global DNS configuration
  4. local DNS (to reroute domain name in URI)
  5. local router (to reroute at IP level)

The choice of solution will probably be dominated by social and economic considerations: people will do what's easiest, subject to their own aims (such as preserving their credibility or reputation).

Let's apply this analysis to some particular situations.

Simple disappointment

  1. Alice GETs http://example.com/a, gets D1.
  2. Takes on an expectation that future GETs will yield D1 or something similar.
  3. Creates link.
  4. Subsequent GET yields dissimilar D2 or 404, upsetting expectation.
  5. Contacts site administrator.
  6. Cases:
    1. Site administrator admits problem and fixes it.
    2. Alice modifies local configuration (browser, DNS, etc) to reroute the URI so as to meet expectations.
    3. Alice gives up, fixes link(s) or performs damage control if possible.

"Modifies local configuration," my second fantasy, needs some explanation as this is not currently standard practice on the Web. This means doing whatever is in Alice's power to force the GET to meet expectations. This could be done in various ways:

In any case the chosen solution would have to be supported by "second sourcing" - infrastructure that provides behavior that meets expectations. An enterprise might be in a position to arrange this for URIs it cares about, but more likely the infrastructure would come from a community resource or paid service (assuming expectations were shared among its users). The switch from failing to working behavior could be provided either as a central or replicated service, or on the client side, similar to a spam-block blacklist.

If there is sufficient demand for remediation of this sort, products and standards will emerge to meet this need. My hunch is that as the web grows, so will the pressure to provide this kind of thing.

Versioning

A common problem is to expect GET of a URI to yield a particular version of a document, when in fact the server gives revisions or replacements of the document as they come out. The URI is really associated with expectations, not necessarily with any particular version or invariant. In order to use a URI as giving a way to get a particular version one should extract a credible promise of stability from the server. Given that, the particular bits one GETs once can become part of what one can expect later.

Of course this problem is widely recognized. For example, it is the motivation behind W3C's versioning practices, and behind the WebCite service used by academic journals.

This does not mean that it's not good to use a URI in the absence of a promise of stability. For example http://news.google.com/ is useful precisely because its GET responses do vary - and we expect them to.

An expectation of stability can be a gamble or a user error. The hope for stability may be due to the lack of a stable source for the same information (another URI). However, a server may set expectation of stability, and then violate it. In this case the responsibility is on the server.

Persistence

  1. Site advertises U as a "permalink".
  2. Alice does GET U, which fails (host down, 404, or wrong result).
  3. Contacts site administrator.
  4. Out of business or unresponsive.
  5. Alice finds desired document in another way (different location and/or protocol).
  6. Alice modifies local configuration (browser, DNS, etc) to reroute, as above.

Observe that any theory of URI persistence has to allow for remediation, as human institutions are inherently fallible. Second sourcing - e.g. putting copies of a document in multiple libraries - is the only proven approach to persistence.

Stability of content is the easiest form of persistence to understand and specify, and it has a long history and deep infrastructure in the world of libraries and archives. But it is just a special case of meeting expectations. A resource that yields the number of days since January 1, 2000 is not stable in its GET responses but has stable, clear expectations and is therefore amenable to second sourcing.

Ontology revision

This is a special case of versioning, of particular concern to the use of RDF and OWL in scholarly publishing.

  1. Alice reads OWL file O1 via GET U providing axioms involving URIs of the form U#frag.
  2. Alice publishes OWL document D that uses these URIs, such that the correctness of D is contingent on the axioms in O1.
  3. URI owner updates ontology, so GET U yields a different OWL document O2.
  4. Alice is made to appear incompetent as D is now understood relative to O2 instead of O1.
  5. Alice attempts to get the URI owner to revert GET U to O1, and fails.
  6. Alice backpedals:
    1. updates D to somehow refer to O1 instead of using U as a reference, or
    2. if D has been deposited archivally, publishes an apology and hopes that her friends read the apology, or
    3. the archive itself apologizes for receiving an item containing a non-persistent reference, and deposits an amendment.

OWL 2 has support for substituting a URI for which one has expectation of stable behavior for a URI for which one does. As long as the expectation is met, the ontology reference should be able to preserve Alice's (and the publisher's) reputation.

Metadata architecture

The story is similar for metadata (author, title, publication date of the document yielded by GET U). Metadata expressed in terms of U (e.g. in RDF: <http://example.com/whgb> dc:title "When HTTP Goes Bad") sets an expectation regarding GET U, namely that it yields a document to which that metadata applies (one with the given title, and so on).

Metadata might be published by the URI owner, or by someone else. If by the URI owner, the expectation is a kind of promise, and meeting it is the owner's responsibility (not necessarily fulfilled). If by someone else, the URI owner has not made promises and has plausible deniability of wrondoing. Client-side remediation is therefore needed.

Other protocols and languages

There is no reason to think a priori that the way a URI is used under HTTP imposes any restriction on how it is used under another protocol or language, or under extensions added to HTTP such as new request methods. Even the web architecture theory of "resource identity" at best only says that operational behavior should be somehow related to some hypothetical "resource". For this to be meaningful at all the choice of resource is somehow known (and it's not clear how such a choice would be made operational). The link would have to be drawn either via agreement (perhaps the spec for the second protocol articulates the relationship to HTTP) or via common convention.

This problem of coupling to HTTP applies as much to the use of the URI in RDF/OWL/SPARQL as to any other language or protocol. (In RDF "use" would mean what kind of thing happens when you do a query or apply an inference engine.) The link to HTTP is arbitrary and needs to be made explicit if expectations are to be set.

In any case, a second protocol simply establishes a new set of expectations of Web (or Internet) behavior related to use of the URI, and the same analysis (of what happens when expectation is upset) applies.

httpRange-14

The httpRange-14 rule provides a threadbare link between HTTP and RDF/OWL/SPARQL.

  1. Alice does GET U, which yields a 200 response.
  2. Alice infers (based on the httpRange-14 rule) that U refers to an "information resource", investing accordingly, perhaps by composing some RDF. (I'm not sure what interesting consequences would follow only from U referring to an information resource, but let's assume.)
  3. The URI owner publishes an assertion that the URI refers to a person (U rdf:type foaf:Person).
  4. Alice contacts the URI owner, points them to httpRange-14 rule, owner acknowledges desire to conform, retracts person assertion.
  5. Or: Alice contacts the URI owner, points them to httpRange-14 rule, owner acknowledges desire to conform, promises never again to yield another 200 for that URI.

Protocol correctness vs. application correctness

This analysis underscores the difference between two notions of "correctness" of URI use: correct use according to the HTTP protocol (RFC 2616), and correct use according to application needs. ("Use" encompasses what any agent might do with the URI, including both clients and servers.)

It is worth considering the idea that these two kinds of correctness ought to coincide, that is, that the origin server is always right (regarding HTTP responses), and any application that assumes otherwise is badly designed. This is a conservative, consistent, and parsimonious position, and is the premise behind advocacy for non-http: naming schemes such as urn: . The claim is that uses of URNs can meet certain application expectations that can't be made consistent with expectations of the use of http: URI that were set by the protocol spec.

Another way of saying this is that application expectations constitute "squatting" - taking the right to set expectations away from the HTTP specification and the domain registry. Even the URI owner is a "squatter" if he/she promises HTTP behavior that differs from what is delivered, if there's a chance the promise may not be met. In this theory meaning is set by the Web, and by the owner only indirectly through his/her control of HTTP behavior.

To make this work would require all http: URIs occurring in applications and documents to be "field upgradeable," since all http: URIs are linked to services and all services are unreliable. An incorrect link in an HTML document would be considered a bug in the document, and the document would have to be modified in order to fix the problem. One would therefore never (correctly) put an http: URI in an archival (write-once) store, since by making a document unwriteable one loses the opportunity to fix its http: URIs.

Because at present there is no general practice of second sourcing or consumer-side modulation of URI behavior, the single-correctness assumption is the status quo. In particular, the assumption is incompatible with the use of unbound http: URIs in documents (and RDF documents in particular), when the intent is to deposit the document in a persistent store.

(Why the qualifier "unbound": if sufficient information is present in or alongside the RDF to match each http: URI to its expectations (e.g. to a document or to a document describing use of the URI), then the URI is 'bound' and the user of the archive has no need to consult a reference work such as the Web to research its meaning.)

Documenting URI expectations

URI second sourcing treats URIs as public property. If expectations are shared by a community, that community can share a second source of URI behavior. However there is no reason to think that URIs are different from any other natural or engineered language in their ability to gain stable consensus expectations. Expectations can vary through time (the server might do different things at different times - e.g. serve different versions of a specification), from one context / protocol / language to another, or from one artifact to another (as when, in XML or RDF, a URI is used one way in one database, and a different way in another database).

While it is possible that an occurrence of a URI will be accompanied by sufficient information (such as a scholarly citation) to determine what is expected of it, often this will not be the case. If expectations are variable (the meaning has changed through time or across contexts) an agent would be delighted to have a user's guide to that URI available. It is not a prescriptive specification that is needed, as any particular specification may be either versioned, ignored, or interpreted. Instead the most useful form of documentation is the kind that we enjoy for natural language: a dictionary or encyclopedia. The salient feature of dictionaries and encylopedias is their neutrality - they explain as much as is known (to the authors of the reference work) of the the way the word or phrase has been used over time. They are objective first, and prescriptive only informatively.

The "follow your nose" process (and its possible extension through Link:) may lead to documentation, and perhaps even promises regarding future behavior. Current practice makes this documentation, when it exists, quite thin, reflecting only what the URI owner wants others to hear at the present time. If the URI has changed hands or recommendations have been revised then the documentation may not adequately serve those who consult it, as the use to be documented may not be described by the documentation.

Documentation that is specific enough will resemble API documentation or a contract proposal, and might enable second sourcing of server behavior, just as adequately detailed API documentation enables correct (re)implementation of an API or a contract proposal enables one to put the contract out for bids. Given clearly articulated expectations the DNS/ICANN Web might become just one contender for the job of URI service, with competition over price and quality a distinct possibility. It would be useful to predict what circumstances might bring about such competition, so that we might anticipate developments and issue technical recommendations that might either prevent, channel, or facilitate such a future, as is our wish.

Acknowledgments

Dan Connolly for "field upgradeable" and many other ideas, Alan for many discussions of second sourcing, URI documentation, and the shared names idea, Tim Danford for a conversation about the URI encyclopedia, Larry Masinter for issuing the challenge, David Booth for forcing the rejection of "resource" and "denotation", Herbert van de Sompel for discussing Memento, and others too numerous to mention.