W3C Activities Propagation

Caching and Replication - the Problem

To futher define the propogation, replication and chaching activity area, here is a summary of the problem intended to help fram a theoretical analysis of the protocol.

These notes refined during a discussion with Baruch Awerbuch/MIT-LCS. Thanks for comments from Phill Hallam-Baker.

Model

The net

Let us take the model of a network of internet hosts. Messages may be sent between any two nodes.

Some hosts are permanent, some temporary. One can send an unsolicited message to a permanent host with the assuption that it will be there to receive it. (W3 servers are for example "permanet" normally). Temporary hosts may be available only from time to time. Whilst they may store state information in the protocol for their own purposes over long time periods, it is unwise for another node to require anything of the persistence of that state information.

One can make the assumption that for a message passing delay t(a,b) between nodes a and b, that in general

t(a,b) <= t(a,c) + t(c,b)

t(a,b) can become effectively infinte as parts of the network break.

Documents

Documents are stored originally on permanent nodes. We assume here that for any document there is a permanent node ("origin server") from which it can always be retrieved. The document is identified by a URL which contains the name of the node.

Some documents are virtual documents, in that their state is defined by a function and many vary depending on many things. Some documents change very slowly.

Each document contains links to a number of other documents.

The task

The task is to define a protocol between permanent and temporary nodes to allow copies of documents to be moved around such that when at any node a document is requested, it will be available within a short time with high probability.

The program making the request is known as the client.

When a given server receives many requests in a short time, it can only process them with a certain (statistically distributed) speed, and can queue a limited number. It can start simultaneous processing of requests in which case the number of concurrent requests advresly affects the time for the requet to be completed.

There is no fixed cost function for document access delay and this may vary from request to request. The cost however can be assumed to be only a function of the delay from the request being made by the client to the (first part of the) document becoming avaialbel at the client.

Constraints

The constraints on the problem are as follows.

Inputs to the process

In general, the bulk of the data may follow well defined statistical patterns but the network must be able to cope with wild extremes. For example, netwrok varies ingeneral slowly hour by hour with dirunal and weekly patterns, but sudden outages also occur. Many servers have Zipf-distributed document interest levels but some servers have an unbounded number of virtual documents. Furthurmore, changes in technology and use patterns can be expected in the future which will invalidate too detailed an assuption, so we can conclude that the system must be extremely adaptive.

Roll out

When the system has ben designed, a transition strategy will have to be designed from the current system. Currently two systems are basically in use: the client either directly requests a document from the original server, or it request every document froma given "proxy" node (shared by several clients) which either requests the document itself from the origin server, or which returns a cached copy which it assumes it still valid, or checks in real time is still valid by polling the original server.

Note that some clients already implement a feature, when folowing a link from document a to document b, of telling the server (or proxy) for b both a and b. This allows the server for b to deduce the existence of a, and inprinciple to tell the server for a through some new protocol that the link had been followed. This would allow a to build up a "branching ratio" database, which could possibly be used to predict future usage.


Tim BL 950315