State and Storage: Files, Documents, and Resources

The World Wide Web is a universal information space. Informally, we discuss the Web as if it were frozen in time and identical from all perspectives; we say that "the title of http://www.w3.org/xyz is 'One Fine Day'" despite the fact that the content---including the title---of this resource may change over time, and despite the fact that it may be available in French with a different title.

To specify the protocols that govern the Web, it is essential to realize that in fact, the Web is a sort of mass hallucination shared among all the people and machines distributed around the globe who accept the principles of Web Architecture, much the way businesses and consumers accept the principles of an economy based on paper currency. By and large, we agree that there is one http://www.w3.org/xyz, even though each of us has slightly different experiences of it, much like by and large, people in the U.S. have a shared concept of the value of a dollar, even though in fact each person has a slightly different perspective on what they're willing to trade for one. The large scale effect is the result of each participant following the same principles when they communicate and interact with each other.

When I say "the title of http://history.org/1492 is 'Christopher Columbus goes to America'," all I really know is that my machine sent some packets to history.org and get back some packets with HTTP and HTML syntax that indicate a title of 'Christopher Columbus goes to America' (and I only believe that because I assume the machines didn't malfunction and weren't tampered with). If you follow that reference with your machine, you probably expect to see the same title even though I don't guarantee it for a fact.

But your expectations are not the sort of "2+2=4" certainty. Nothing prohibits the history.org webmasters from setting up his server to return completely random results in response to requests for /1492. But why would they do that? They publish the /1942 resource to participate in the network effects of the Web, whose value depends on the fact that following a link labelled "Christopher Columbus goes to America" usually rewards the user with a resource that is relevant to that subject.

To model state distribution in the Web mathematically, we take as axiomatic that URIs are unambiguous, i.e. each URI refers to exactly one resource (c.f. sameness in Axioms of Web architecture). We leave the issue of resource identity in the informal realm of philosophy; only URIs occur in this formal mode.

The relationship between URIs, resources, and content is similar to the relationship between identifiers, variables, and values in a program: it depends on the state of the program. But the Web is a distributed, parallel computation, not a sequential program. Its state is exposed in messages between agents, ala CSP[Hoare78]. In particular, for

represents(m, i, c) read: a message m represents a URI i as literal content c (@@where c is sequence of bytes paired with an Internet Media Type)
[for Larch translation, see HTTP trait]

For example, in the history.org case above, the 200 OK response from history.org represents http://history.org/1492 as an HTML document .

Note that a message is not just the bytes sent; it's the event of sending the packets. We don't say that a message is "sent twice"; rather, we say that two messages are sent with the same bytes. (This is subtly different from the defintion of message in the HTTP 1.1 spec. @@hmm... for this reason, use "event" in stead of message? I don't think so...)

There is a strict partial order on messages:

if m₁ happens before m₂ and m₂ happens before m₃, then m₁ happens before m₃
if m₁ happens before m₂, then it can't be that m₂ happens before m₁

Let's start with a simple, constrained case of the represents relationship, and then examine the complexities by relaxing the constraints.

@@@stuff below here may be inconsistent

Fixed Content Resources

In one of the simplest cases of a resource, there is only one piece of associated content ever shown for the resource. Defintion: a resource r is a fixed content resource iff there is exactly one piece of content c such that for all m, m shows c for r.

For example, if a MIME message is sent with a body part such as:

Content-Id: 2903874923802938409283@w3.org
Content-Type: text/plain

Four score and seven years ago...

then that message shows ("Four score and seven years ago...", text/plain) for the URI cid:2903874923802938409283@w3.org. And the MIME specification guarantees that even if that body part is sent enclosed in another message, no message will show any content other than ("Four score and seven years ago...", text/plain) for the URI cid:2903874923802938409283@w3.org.

[@@use the shows relation to specifiy the "NOTE ON THE SEMANTICS OF CONTENT-ID IN MULTIPART/ALTERNATIVE"]

Modelling Protocol Violations

While it is dangerous to assume that a fixed content resource has only one URI, the converse is true for cid: URIs. The principle of unambiguous URIs says that each cid: URI refers to one resource, and the cid: URI scheme specifies that it is a fixed content resource. If one MIME message is sent with a body part of:

Content-Id: not-much-entropy@lazy.net

abc

and another MIME message is sent with a body part of:

Content-Id: not-much-entropy@lazy.net

def

then one of the senders has violated the MIME specification and hence the cid: URI scheme specification.

This is just one instance of the general rule: consistency in this mathematical model of the Web assumes conformance to the relevant specifications. (@@or does it? does forgery/corruption show up in the model somehow?)

Quasi-Static Resources

@@ section 10.3.5 304 Not Modified

req1: C->S: GET /index.html HTTP/1.1
Host: h
resp1: S->C: 200 OK
Etag: "tag1"

content1
req2: C->S: GET /index.html HTTP/1.1
Host: h
If-not-match: "tag1"
resp2: S->C: 304 not modified

The client claims that resp1 happens before req2, i.e. they are causally ordered, by copying the validator tag0, from resp1 to req2.

asserts forall m1,m2: Message, c: Content, i: URI

if shows(m1, c, i) and etag(m2) = etag(m1) then shows(m2, c, i)

shows(resp1, content1, "http://h/index.html")

...@@

shows(resp2, content1, "http://h/index.html")

This is not falsifyable/observable:

If any of the entity tags match the entity tag of the entity that would have been returned in the response to a similar GET request (without the If-None-Match header) on that resource, or if "*" is given and any current entity exists for that resource, then the server MUST NOT perform the requested method, unless required to do so because the resource's modification date fails to match that supplied in an If-Modified-Since header field in the request.

Dangerous phrases #1: the content of a resource

@@can only say "the content of a fixed content resource"; "the content of a resource" is an ill-defined definite description[Russel12]

Fixed Content in HTTP

The cid: URI scheme is just one mechanism that indicates that a resource is a fixed content resource.

@@section 14.44 Vary of the HTTP spec; if we could say:

Vary: none
Expires: never

in HTTP, we could use HTTP to declare fixed content resources. But there's no syntax for that. We can exploit the fact that Content-MD5 headers do not occur in GET requests and write:

Vary: Content-MD5

but there's no work-around for the lack of Expires: never. So the HTTP 1.1 protocol can only declare quasi-static resources@@.

Aliases

shows(m, c, i1) /\ content-location(m, i2) => shows(m, c, i2)

The Content-Location value is not a replacement for the original requested URI; it is only a statement of the location of the resource corresponding to this particular entity at the time of the request. Future requests MAY use the Content-Location URI if the desire is to identify the source of that particular entity.
-- HTTP/1.1, section 14.14 Content-Location

Redirection

@@redirection: redirect(m1, m2, m3, m4) ::= http(m1, m2) and http(m3, m4) and status(m2) == 304 and address(m3) = location(m2) and m2 happens before m3. If redirect(m1, m2, m3, m4) and shows(m4, c, address(m3)) then shows(m4, c, address(m1)).

Stable publishing

@@ fixed content plus assurance of availability

@@hm... compound documents

Intrinsic, Descriptive, and Extrinsic properties of Resources (@@for lack of a better heading)

@@"the service, which you can see is on port 80" versus "the service, which was started in 1996" versus "the service that was started in 1996"; Content-Length is which-you-can-see-like; Content-Location is which-like, Content-Type is that-like.

@@Sense/Reference

Ian suggests relating this to the morning star/evening star issue:

The statement "1999-07-22 = 1999-W294" is, in other words, completely analogous to the statements

117 + 136 = 253.
The morning star is identical to the evening star.
Mark Twain is Samuel Clemens.
Bill is Debbie's father.

cited in the Frege article in the Stanford Encyclopedia of Philosophy as examples of statements requiring that we make some distinction between sense and reference in order to make sense of them and account for their cognitive significance.
-- C M Sperberg-McQueen
Thu, 22 Jul 1999 19:52:32 -0500

Acknowledgements and @@Fodder

World Wide Web Architecture Paul Burchard work in progress

earlier larch stuff: Jan 1995: formalism , Jan 1996 webarch

A Formalism for Internet Information References

The Web Object Model and Type System

Formalizing Web Technology Jan 1994 presentation

@@discussion of who gets to say/define what a URI refers to.

@@non-ambiguity principle not guaranteed by DNS, too often violated by information providers: "cool URIs don't change"

@@formally specify HTTP caching: ETag, last-modified, max-age

@@formally specify 3rd party auth server protocol with HTTP digest auth (check with dsr)

@@WCA Terms

compare with, cite WCA terms spec

@@U-REST

have a look at U-REST spec

@@ RDF toolbox

@@ consider mapping to POSIX filesystem API (HTTPfs, aka; WEBDAV)

Borrowing from database terminology, HTTPFS provides an isolation level of "Repeatable Read" for concurrent file transactions.

citation for "Repeatable Read" terminology?

A directory request is returned as plain text, in a format similar to a 'ls -l' listing.

why not some HTML/XML dialect? e.g. <dir><li><a href="name">name</a></li></dir>

These authentication/authorization policies are the responsibility of a web server's administrator; MCHFS need not be aware of them.

argh! write authorization is the biggest issue! skipping over it makes the whole thing just a toy.

References

[Hoare78]: Hoare C.A.R., "Communicating Sequential Processes," Communications of the ACM, vol. 21, no. 8, Aug. 1978, pp. 666-677
(source)

Dan Connolly
$Revision: 1.32 $
$Date: 1999/09/17 21:28:26 $ $Author: connolly $