A Formalism for Internet Information References

Daniel W. Connolly
$Id: formalism.html,v 1.6 1995/02/08 08:19:38 connolly Exp $

Abstract

This is a mathematical model of computation for the resolution of references between typical information objects in distributed information applications in the internet community; specifically, the model covers the URI concept as used in HTTP and the World Wide Web, and the external body mechanism from MIME.

It provides a foundation for formal definition of semantics of so called meta-information in HTTP, URCs, so that distributed computing issues such as reliability, scalability, and security can be explored.

See Resource Discovery and Reliable Links for a discussion of terminology, related issues, and related discussion resources.

Introduction

From Integration of Internet Information Resources (iiir), Mon Apr 18 22:00:54 CDT 1994::

The Integration of Internet Information Resources Working Group (IIIR) is chartered to facilitate interoperability between Internet Information Services, and to develop, specify, and align protocols designed to integrate the plethora of Internet information services (WAIS, ARCHIE, Prospero, etc.) into a single ``virtually unified information service'' (VUIS).

The body of internet information resources is available through a number of widely deployed technologies (FTP, Gopher, HTTP, WAIS), and there are several successfully deployed applications that combine these technologies to provide information consumers with a consistent model of information regardless of the underlying technology.

But this consistent model breaks down due to a variety of faults, and the user is often left to wonder where the heart of the problem lies. With a comprehensive model of computation, it will be at least possible to define the correct behaviour and a set of fault detection and toleration mechanisms.

On the other hand, this user model has enabled a much larger audience to access internet information resources. The result is a noticeable increase in network traffic. The client-server model where all N information clients make round trips to all M information servers, creating traffic on the order of NxM, is slowly giving way to resource migration and load-balancing techniques (e.g. caching and mirroring). But these techniques are being deployed in an ad-hoc fashion, and it is not clear that, for example, proxy servers do not introduce complications to the underlying protocols.

Security, privacy, and intellectual property issues are only beginning to be addressed. (For example, proxy servers completely punt on the issue of caching access-controlled documents). Ad-hoc techniques are not acceptable strategies to address these issues.

The technology to support the growing base of internet information resources will only get more sophisticated as we attack to the problem of large scale data reduction (resource discovery and navigation) and as we employ the machine more and more to augment learning and the matinenance of information. Formal techniques are necessary to reduce the complexity of such technology.

Foundations

This formalism is based on the first-order, many-sorted logic of Larch, in the hopes that it can be integrated into the development of software that implements the formalism.

The formalism comprises the following Larch traits:

Reliable Caching and Mirroring

An Example Scenario

This section is somewhat out of date

As an example, consider successive accesses of http://S/path via an HTTP proxy P:

Client C1 sends a request req1 for r=http://S/path to P.

P contacts S, makes a GET request for /path, and receives a response resp1:

	Date: Mon Dec 12 19:30:39 CST 1994
	Last-Modified: Fri Dec 9 19:30:39 CST 1994
	Expires: Fri Dec 16 19:30:39 CST 1994
	Content-Type: text/plain
	
	blah blah blah...

Let

e0 = (text/plain, "blah blah blah...")
t0 = Mon Dec 12 19:30:39 CST 1994
t1 = Fri Dec 9 19:30:39 CST 1994
t2 = Expires: Fri Dec 16 19:30:39 CST 1994

Then at this point, the proxy knows:

	e0 = HTTPGet(req1, t0), i.e.
		e0 in Represent(r, t0)
		and e0 minimizes AcceptPenalty(req1, Represent(r, t0))
			    over Represent(r, t0)
	Last-Modified(r, t0) = t1
	Expires(r, t0) = t2

Since there are no URI: headers in the response, the proxy also knows:

	Represent(r, t0) = {e0}

The proxy passes resp1, containing e0 on to the client C1. It updates its cache so that Pcache[r] = resp1.
Client C2 makes a request req2 via the proxy P. req2 has the same URI as req1.
P examines its cache and finds Pcache[r] = resp1.
- If the proxy is willing to make the hueristic assumption that its clock is in sync with S's clock, then it can test whether the current local time t3 satisfies t0 <= t3 <= t2. If this is the case, then by the definition of Expires, the proxy can conclude that Represent(r, t3) = Represent(r, t0).
  Since it knows that e0 is the only element of Represent(r, t0) and hence of Represent(r, t3), it can conclude that e0 = HTTPGet(req2, t3), and it can return resp1 to C2.
- If the t3 <= t2 test fails, or if the proxy wants to be strict about distributed notions of time, it can issue a "conditional get," using t0 in the request header:
```
	 If-Modified-Since: Mon Dec 12 19:30:39 CST 1994
	 
```
  If the document has not been modified, the server S will indicate this via resp2 at t4:
```
	 304 Not Modified
	 Date: Mon Dec 12 15:30:39 CST 1994
	 
```
  The proxy can conclude that Expires(r, t4) = Expires(r, t0), and proceed as above.

Future Discussion

Other attributes of references

There are certain attributes commont to many information resources. For the resources to which some of these attributes apply, we should develop models and mechanism to make it possible compute the attributes reliably:

<attribute> (uses)
Author (identification, navigation, relavence, reply)
Date of publication (identifying, navigation, relavence)
MD5 signature (identification)
Message-ID, Content-ID (identification)
Maintainer (problem reporting, replying, fallback for annotation)
Abstract (indexing, relavence)
Date of last revision (identification, navigation, relavence)
Expiration (resource migration, traffic optimization)
Translations (to various formats/languages)
Versions
Copies (resource migration, traffic optimization)
Parent, Children (navigation, relavence)
Previous, Next (navigation, relavence)
TOC, Index (navigation)
Back-links (navigation, relavence, maintenance)