Caching and Replication - the Problem

To futher define the propogation, replication and chaching activity area, here is a summary of the problem intended to help fram a theoretical analysis of the protocol.

These notes refined during a discussion with Baruch Awerbuch/MIT-LCS. Thanks for comments from Phill Hallam-Baker.

Model

The net

Let us take the model of a network of internet hosts. Messages may be sent between any two nodes.

Some hosts are permanent, some temporary. One can send an unsolicited message to a permanent host with the assuption that it will be there to receive it. (W3 servers are for example "permanet" normally). Temporary hosts may be available only from time to time. Whilst they may store state information in the protocol for their own purposes over long time periods, it is unwise for another node to require anything of the persistence of that state information.

One can make the assumption that for a message passing delay t(a,b) between nodes a and b, that in general

t(a,b) <= t(a,c) + t(c,b)

t(a,b) can become effectively infinte as parts of the network break.

Documents

Documents are stored originally on permanent nodes. We assume here that for any document there is a permanent node ("origin server") from which it can always be retrieved. The document is identified by a URL which contains the name of the node.

Some documents are virtual documents, in that their state is defined by a function and many vary depending on many things. Some documents change very slowly.

Each document contains links to a number of other documents.

The task

The task is to define a protocol between permanent and temporary nodes to allow copies of documents to be moved around such that when at any node a document is requested, it will be available within a short time with high probability.

The program making the request is known as the client.

When a given server receives many requests in a short time, it can only process them with a certain (statistically distributed) speed, and can queue a limited number. It can start simultaneous processing of requests in which case the number of concurrent requests advresly affects the time for the requet to be completed.

There is no fixed cost function for document access delay and this may vary from request to request. The cost however can be assumed to be only a function of the delay from the request being made by the client to the (first part of the) document becoming avaialbel at the client.

Constraints

The constraints on the problem are as follows.

There is a limit on the bandwidth between any two nodes of the nedtwork, and congestion on any part of the net used between a and b tends to increase t(a,b).
There is only a limited knowledge at any one node of the topology and delay function t(a,b) for the net. This information is typically gained at each node by experience and changes of the order of hours.
The client has a choice of using the new protocol or not. The client always has the choice of simple protocll of directly accessing the original server, so the new protocol must provide a performance which is superior to that.
Typical figues are that message round trip times are of the order of 100ms, document processing on the server of the order of 0.1-1s, Maximum number of requests per second at a server of the order of 10-100, Number of links in a document 0-10-1000. Typically servers may have 100-10000 documents.
Document popularity seems from studies to be roughly Zipf distriuted, with the kth most popular document having a relative poularity of 1/k, with k in 1..N.
There is a need for some nodes to be informed of every request for a document for which they are the original server to be notified (not necessarily immediately) either individually or in summary for reasons of analysis and accounting.
Permanent and temporary nodes may have state storage for protocol related information (such as a directory of where document copies exist) but this space is limited to of the order of 1MB on temprary and 100MB - 1GB on permanent nodes.
We can assume that clients are well behaved and do abide by the rules of the protocol. (For comparison, we assume that TCP clients obey the "slow start" of TCP for the common good even though it is to their individual marginal disinterest.)
Often, interest in a given document will change slowly, possibly with diurnal or weekly modulation. However, ther can be pathalogical cases which produce sudden interest in a document (such as the wide circulation of its URL for example on a chat channel or television).

Inputs to the process

The topology of the net
The topology of the web of links
Possibly, the structure of the URL. The URL contains a node name and an opaque string. In practice, theire is locality of reference within node names, and even within directories of the hierarchical part of the "opaque" string in many cases. However, to rely on this will favor certain systems which used particularly constrained hypertext topology. If the structure of the URL can be used to very great improvement in the performance of the system, this may be interesting, but solutions which treat the URL as opaque would be preferred.
Dynamic network load (measure/estimated) as an input to the system. (-phb)

In general, the bulk of the data may follow well defined statistical patterns but the network must be able to cope with wild extremes. For example, netwrok varies ingeneral slowly hour by hour with dirunal and weekly patterns, but sudden outages also occur. Many servers have Zipf-distributed document interest levels but some servers have an unbounded number of virtual documents. Furthurmore, changes in technology and use patterns can be expected in the future which will invalidate too detailed an assuption, so we can conclude that the system must be extremely adaptive.

Roll out

When the system has ben designed, a transition strategy will have to be designed from the current system. Currently two systems are basically in use: the client either directly requests a document from the original server, or it request every document froma given "proxy" node (shared by several clients) which either requests the document itself from the origin server, or which returns a cached copy which it assumes it still valid, or checks in real time is still valid by polling the original server.

Note that some clients already implement a feature, when folowing a link from document a to document b, of telling the server (or proxy) for b both a and b. This allows the server for b to deduce the existence of a, and inprinciple to tell the server for a through some new protocol that the link had been followed. This would allow a to build up a "branching ratio" database, which could possibly be used to predict future usage.

Tim BL 950315