Toward an ontology for HTTP/1.1 semantics

Nooding diagram: http://sw.neurocommons.org/tmp/jar-diagram-6.pdf (temporary location)

(Page started by JAR during the AWWSW telecon of 11 November 2008)

AWWSW has a number of projects in front of it. One goal is to write an ontology that models core features of HTTP/1.1 as specified in RFC 2616.

A better starting point than the RFC might be HTTPbis, with the caution that this is still only in draft form.

By "model" I mean deal with the semantics of the interaction - what the requester and responder can be "held to" assuming they are correctly following the protocol. This differs from HTTP-in-RDF, which so far accounts only for the syntax of the interaction.

By this I mean only HTTP/1.1, not HTTP/1.1 as interpreted by AWWW, TAG findings, or any other documents. There is no way to avoid some amount of interpretation, since the spec is so sketchy in so many places; but such interpretation should be guided by (a) actual HTTP practice in the wild (i.e. the way other people have interpreted it), and (b) supporting documentation, such as email and papers by Fielding et al, that are consistent with (a).

I haven't started on any of the content negotiation details, but that will have to be done at some point.

One question is whether this is worthwhile. We decided at one of our telecons that it was, both as a way to extend the work in HTTP-in-RDF beyond the purely syntactic realm, and to tether the more general semantic web work that we're engaged in. There is also a desire to capture much of the mechanism of HTTP in RDF so that we can talk about, say, the classes of correct and incorrect HTTP exchanges - if there are any... but this is harder and it's not clear what the payoff would be...

Entity (class): this is a syntactic notion and should be completely uncontroversial as intent and behavior do not enter into any account of it. There has been talk in HTTPbis of replacing the term "entity" with "representation", but this hasn't happened yet. But we should be prepared to follow suit if it does so. Beware - the RFC talks of "instances of that entity" (13.10) and "last modified", as if entities could change. -- I recommend that this class be incorporated into HTTP-in-RDF.
HttpResource (class): what the spec calls "resource": "A network data object or service that can be identified by a URI". How this compares to rdf:Resource, foaf:Document, and awww:InformationResource is probably best left to other ontologies as HTTP/1.1 was designed without any of those things in mind. Whether it's the same as RFC 2396 "resource" is not clear.

I don't think we need a class Representation (unless HTTPbis renames "entity" to "representation"). The meaning can be captured better in relationships. 3.1 has "an entity ... that is subject to content negotiation" - I would interpret "subject to" as meaning "may be chosen by" or "may be the outcome of". As a class, Representation would have to be those entities (a proper subclass) that are, or were, or might reasonably be subject to CN. This is not inherent to the entity, but contextual. To say that E is an entity but not a representation would be to say that no HTTP server generates E as a result of CN. Not sure whether that is at the present time, or for all time, but in either case it is an empirical statement about the population of the world's HTTP servers.

(Can someone explain to me the difference between a "variant" of X and a "representation" of X? 14.14 "a Content-Location for the variant corresponding to the response entity")

Correspondence

We need a way to relate Entities to HttpResources, and RFC 2616 supplies several ways to express what seems to be the same relationship:

corresponds to: 10.2.1 "an entity corresponding to the requested resource", HTTPbis 8.2.1 "an entity corresponding to the requested resource is sent in the response". (Beware 10.3.1 "The requested resource corresponds to any one of a set of representations" (backwards!); 14.14 "the resource corresponding to this particular entity" (also backwards); HTTPbis 8.3.4 "a representation corresponding to the response".)
is associated with: 3.1 "A resource may have one, or more than one, representation(s) associated with it at any given instant"; 3.11 "all versions of all entities associated with a particular resource" (entities can have versions?); 14.14 "where a resource has multiple entities associated with it"
is a representation of: 14.24 "the entity ... is no longer a representation of that resource"; 3.1 "the representation of entities"
is from: 3.11 "entities from the same requested resource"
is the form of: 9.3 "whatever information (in the form of an entity) is identified by ..." (this has been excised from HTTPbis)

My vote is for "corresponds to", especially given that HTTPbis (that means Roy Fielding, Mark Nottingham, et al) uses this.

Meaning of response

A response is a report about a correspondence. It tells you that the entity now corresponds to the resource (the one mentioned as the request-URI of the request), that it has done so for a while, and that it will do so for a while into the future.

Does a 200 response to a GET say anything else about the HttpResource, other than that the entity corresponds to the HttpResource (at the time the response was generated)? The response-headers (as listed in HTTPbis appear to be irrelevant, as they should be. The entity-headers might be relevant; they are listed here.

We know that all the Content- fields vary from one entity to another, so they say nothing about the HttpResource that corresponds_to doesn't capture. That leaves Expires and Last-modified. One might want to say that these say something about changes in the HttpResource - either that a change has occurred, or that is hasn't. But I think they're nothing more than statements about the corresponds_to relation - they don't say anything about changes to the HttpResource. (I had hoped that entity stability could imply something about HttpResource stability, but I don't think it does. Nothing rules out the possibility that a HttpResource could change in consequential ways with no change in whether an entity corresponded to it. Nor is it ruled out that an entity could correspond to a HttpResource at one time, then suddenly stop corresponding to it without the HttpResource itself changing.)

Note that we can't use Expires: and Last-modified: to encode membership in Tim's "fixed resource" class: there is no way to specify a date infinitely far in the past or future, and in addition "HTTP/1.1 servers SHOULD NOT send Expires dates more than one year in the future."

(By the way, in looking this over I learned something: The entity body is encoded for transmission: 4.3 "transfer-coding is a property of the message, not of the original entity"; 3.7.1 "An entity-body transferred via HTTP messages MUST be represented in the appropriate canonical form prior to its transmission". That is, the same entity might be transferred, at different times, using different transfer-codings.)

Modeling

I'm not sure how to model time. Here's a model that seems plausible...

Correspondence (class): the condition of an entity corresponding to a resource continuously for some period of time
corresponder (func. property): the entity doing the corresponding
correspondee (func. property): the resource to which the entity corresponds
holds_at (property): a time at which the correspondence holds (as from Last-modified: or Date:)
holds_until (property): a time just before which the correspondence holds (as from Expires:)

The idea is that a GET/200 exchange tells you that such a Correspondence exists, and it tells you something about its lifetime.

There is an "aboutness" relationship between an HTTP exchange: the response in the exchange reports that there is a Correspondence between the resource "identified" by the request-URI and the entity occurring in the response, it tells you it held at the time the response was sent, and it may tell you an interval during which it holds (if those fields are present in the response).

The continuity condition ensures that a Correspondence holds during a single time interval. You just might not know what it is. But if you know that it holds at t1 and until t2, then you know at least that [t1, t2) is a subinterval of the true interval.

If you learn that c1 has corresponder E and correspondee R, and that c2 does as well, and that the lifetimes of c1 overlaps that of c2, then c1 is c2.

When interpreting response headers, if you don't believe the sender's clock, you should either correct the times in the headers, or refuse to believe them. That is, the time reported and the time at which the entity corresponded to the HttpResource may be different. A server following the protocol might have an incorrect clock but still have internal consistency: 13.2.4 "Note that this calculation is not vulnerable to clock skew, since all of the information comes from the origin server." However this is of little help if you are trying to learn something about the HttpResource itself.

BFO comparison: A Correspondence is sort of like a quality, an aspect of some thing (the resource, or a process in which the resource participates) that endures for some period of time. But the Correspondence doesn't necessarily inhere in the resource itself, as it can start or end without there being any change in the resource at those times.

Request/response exchanges

If you're concerned about recording provenance of information about Correspondences, you might want to record that a particular request/response interaction (recorded using the HTTP-in-RDF ontology, perhaps) led you to believe that the Correspondence existed and had (at least) a certain lifetime.

The request will tell you what URI was used to name the HttpResource.

A strong_etag, provided in the response, could be useful for identifying an entity (in conjunction with the URI).

Most of the other information in a GET/200 request/response pair tells you little about either the resource or the entity.

Exchange (class): a process in which a client sends a Request that is received by a server, which following the HTTP protocol composes a Response which is the received by the client.
Get200Exchange (subclass of Exchange): an Exchange in which the request-method is GET and the response-status is 200.
is_evidence_of (property relating Exchange to Correspondence): the Response (in the Exchange) says that the entity in the response now corresponds_to the resource "identified" by the request-URI (in the Request); also that is has corresponded_to the resource at least since the last-modified time, and will correspond_to it at least until the expires time.

Message: as in HTTP-in-RDF... syntactic entities, not events
Request (subclass of Message): similarly
Response (subclass of Message): similarly

Other opportunities

Some entity headers claim something about the entity itself. Content-MD5 certainly has this property, and Content-language: says what language is used in the content.

Content-type: is better thought of as prescriptive, not descriptive, as it communicates something not so much about the content itself as about how the sender wants you to interpret the content.

Following a 200 response to a DELETE request, you know that the HttpResource has been deleted (if the server is following the protocol). Similarly, a successful CREATE tells you the HttpResource didn't previously exist.

304 Not Modified tells you of a corresponds_to relationship, if you have the previously fetched entity on hand.

301 Moved permanently tells you that the target URI is now another name for the same resource.

302 Found and 307 Temporary Redirect don't tell you much because "residing at a URI" is not the same as being "identified" by a URI. But they do suggest that a subsequent GET/200 on the target URI can be taken to imply a corresponds_to relation between the original resource and the entity retrieved from the target.

(Hmm, maybe we need at "resides_at" property or a three-place "Residency" relation?)

Content-location: in a response seems similar to 307, except that the only those entities that correspond_to the target resource AND have matching CN parameters also correspond_to the original resource.

303 See Other has a new description in HTTPbis - that happens to match current LOD (semantic web) practice! "The Location URI indicates a resource that is descriptive of the requested resource" - indicating that a POWDER describedBy relationship can be asserted. Incredible!

But if GET X/302 Y is followed by a GET Y/303 Z (e.g. X=http://purl.org/NET/jar), we don't know that the 303 target Z describes the original resource - it may only describe Y, which is not the same resource as X. We would have to invoke some other assumptions before concluding that.

410 Gone doesn't tell you anything about the resource except that it was once "there".

There ought to be a theory of entity substitution derivable from the spec: that is, rules of the form if E corresponds to R, then E' corresponds to R, where E' is E with some headers added, removed, or modified. For example, if E' is E with its Last-modified: header removed, one would expect that E' would also correspond to R. We should look for places where the spec says that a header is optional or may be added or removed by a proxy.

What could it all mean?

The entity seems to have something to do with the resource, but the spec doesn't tell us what - and probably shouldn't, given that it's a protocol spec. Does it "come from" the resource? Not necessarily; we don't know that the resource has a nature that allows entities to come from it. Does it "speak for" the resource? Similarly. Can all of its essential characteristics be conveyed in a message? Unlikely. Is it "essentially information"? Hard to say. Is it a "hypertext node"? Who knows.

Where do entities come from? They're sent by a server (sometimes at least), and the server finds them or creates them. The server might put them together after consulting a variety of sources - the resource, its own configuration (branding) and expertise (such as reformatting or translation), some other thing that the resource interacts with (such as the HTTP request, or the view seen by the webcam). It's very hard to say, so let's not.

Why do correspondences come and go? Well, presumably something happens to cause this. Maybe the resource changes, or maybe something that the you, or the resource, interacts with changes (such as the codebook you will use to decode the entity). Or maybe the resource knows the population of entity readers has changed, and has changed the correspondences adjust to make it happy. Who knows.

Application to "generic resources" ontology

Consider TimBL's "FixedResource" class. It would be nice if there were an analogous subclass of HttpResource that was not restricted to generic resources. We would need that such an HttpFixedResource has only one correspondence, but do we need an additional restriction on the duration of that correspondence? Such as, that the correspondence holds_until t for all t greater than some time at which it holds, i.e. it's immortal. It would be hard to say that a resource that can be deleted is a fixed resource.

Whether fixed resources also need to be infinitely old is not clear. This seems counterintuitive and unnecessary, but ontological I guess it would be no stranger than having them be immortal.

Similarly we could have an Version (or Stable) subclass of HttpResource. Here it becomes harder to say what this means. Certainly all Correspondences are immortal, but is a stronger condition required, e.g. that they all start at the same time?

Tim's notions of "language invariance" and "content-type invariant" ought to be straightforward, although there are subtleties here too. Does it mean that the Content-Language: header value in every corresponding Entity is the same (or else universally absent)? Or that it denotes the same language in every Entity? (That's not the same as saying that the entity is written using the indicated language, by the way... or is it... need to check the spec.)

Relation to previous discussion of "trace"

In this model, the parameter space is one-dimensional, with time as that dimension. For CN purposes language and content-type are picked up from the entities themselves, and the server can use User-agent (and just about any other information it has, including whim) to choose among entities. I know Tim really wants to treat time the same as other parameters, but it really is different, because it's a parameter that can't be controlled. (You can set the time to be a future time by waiting, but you can't specify a time in the past.)

The trace of a resource is then, for each time t, the set of all entities for which there exists a Correspondence that holds_at t. This makes it the same as Fielding and Taylor's formal resource model.

But as in Tim's generic resource model, there can be distinct HttpResources that both have the same trace. So we cannot say that HttpResource coincide with the Fielding and Taylor formal model. The model is just a model, and doesn't capture these distinctions.

Stuff and nonsense

# RDF capturing outcome of GET http://w3.org/

@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix owl: <http://www.w3.org/2002/07/owl#>.
@prefix wdrs: <http://www.w3.org/2007/05/powder-s#>.
@prefix ht: <http://www.w3.org/2001/tag/awwsw/http.owl#>.

# 301: redirect
<http://w3.org/> ht:residesWith <http://www.w3.org/>.
<http://w3.org/> owl:sameAs <http://www.w3.org/>.

# 200: correspondence between an entity and the resource
<http://w3.org/> a ht:Get200Candidate;
[a ht:Correspondence;
 ht:ofWaRepresentation
  [a ht:Entity;
   ht:hasContentLength 51346;
   # Server-provided etag = "c892-46d0b1c608cc0;89-3f26bd17a2f00"
   ];
 ht:toResource <http://w3.org/>;
 ht:heldAt '2009-06-25T13:00:41Z'^^xsd:dateTime;
 ht:heldAt '2009-06-23T21:59:55Z'^^xsd:dateTime;
 ht:holdsUntil '2009-06-25T13:10:41Z'^^xsd:dateTime;
 ].