Position Paper.

Chris Roadknight.

roadknic@drake.bt.co.uk

If Web traffic characterisation is to be achieved in a meaningful, descriptive way, the elaboration of a user model is essential. If 'typical' usage patterns could be described, then file access patterns could be predicted. The work carried out by myself and others in the British Telecom multimedia systems research group is focused upon cache performance/behavior modeling, we therefore use transaction data from WWW caches [1,2]. We use statistics from several caches at different points in the WWW request chain, including ISP's, academic caches and higher level cache serving caches. Aspects of Web characterisation that are of particular interest to us include:

  1. Understanding the nature of file popularity and its impact on cache performance.

  2. The degree of long range dependence and periodicity in request data and their effects.

  3. The implementation and assessment of automated request algorithms eg. SURGE [3].

  4. File type and client type clustering.

In this position paper I shall discuss points i and ii

File Popularity.

The popularity of files is of particular importance to Web cache performance as the benefits offered by these caches are closely linked to the frequency of requests for some files. Popularity curves (Ranking of files vs Number of requests) can be plotted for file requests at several caches. Further analysis of curves fitted to these plots yield several proposals.

  1. Different caches can be fitted relatively accurately by a Zipf Law curve[4,5], but to achieve optimum fits, significantly different exponents for the Zipf equation must be used.

  2. These exponents are non-stationary.

  3. These exponents have possible implications on the hit rates achieved.

  4. Popularity curves of individual users point to 'typical' usage patterns that are reflected in caches serving a wide community [2].

Long Range Dependence (LRD) and Periodicity.

Actual URLs accessed contain the most information but are difficult to describe quantitatively without content based evaluation. Currently the easiest and most useful quantity is the hit rate achieved by a client over time. It is fundamental to this report that patterns in hit rate are seen to be related to patterns in request behavior of clients.

Cache log files from a number of Web caches were used for source hit rate data, these include caches used by ISPs, Universities and the high level NLANR caches. Several statistical approaches can be used to characterise LRD and periodicity. These include variance analysis, moving averages, frequency domain analysis and cross correlation. The following results can be made that give an insight into request dynamics:

  1. Inter and Intra-cache variation is, partially, deterministic in origin[1].

  2. Time domain periodicity was detected in all caches examined.

  3. This periodicity served to mask existing LRD to the extent that its statistical removal revealed a longer window for LRD than was previously assumed.

  4. Caches at different levels of a cache hierarchy exhibit different levels of periodicity.

 

Conclusions.

The length of LRD in web traffic and the nature of popularity curves are both hotly discussed topics in Web traffic research. The work discussed here goes some way to clarifying details in both topics and may help with the further development of modeling tools such as SURGE. The Web Characterisation Activity will provide a much needed meeting ground for complementary researchers and hopefully provide a secure repository for much needed traffic logs.

References.

[1] C Roadknight, I Marshall. Variations in cache behaviour. In Computer Networks and ISDN systems. 30 (1998), pp.733-735.

[2] I Marshall, C Roadknight. Linking cache performance to user behaviour. In proceedings of 3W3Cache Workshop, Manchester, June 1998. http://wwwcache.ja.net/events/workshop/11/cacheusr_c.htm

[3] P Barford and M Crovella. Generating representative Web workloads for network and server performance evaluation. In Proceedings of the 1998 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems.

[4] G K Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, MA, 1949.

[5] CA Cunha, A Bestavros and M.E Crovella. Characteristics of WWW client-based traces. Technical report TR-95-010, Boston University Department of Computer Science. April 1995.